Remove duplicate values from query result

Posted on

Question :

I am using Postgres 9.3.9. I have 2 tables :

f_agent :

CREATE TABLE f_agent (
  f_agent__id bigserial NOT NULL,
  f_agent__total_users integer NOT NULL,
  f_agent__dim_time_id integer NOT NULL,
  .... other unnecessary columns
);

dim_time :

CREATE TABLE dim_time (
  dim_time__id bigserial NOT NULL,
  dim_time__date date NOT NULL,
  dim_time__month_start_date date NOT NULL,
  dim_time__week_start_date date NOT NULL,
  dim_time__quarter_start_date date NOT NULL,
  dim_time__year_start_date date NOT NULL,
  dim_time__guid uuid NOT NULL DEFAULT uuid_generate_v4(),
);

Now, f_agent__dim_time_id field in f_agent is FK referencing dim_time__id of dim_time.

My dim_time table has a date and its related week_start_date, month_start_date etc. For example: if dim_time__date is 14-07-2015 then dim_time__week_start_date will be 13-07-2015 (monday is start of week) and start_of_month is 01-07-2015 (always the first).

I have to find f_agent__total_users from f_agent given a date range and a dimension (week / month / quarter).

What I’ve tried:

SELECT dim_time__week_start_date, f_agent__total_users 
FROM f_agent
JOIN dim_time 
    ON f_agent__dim_time_id = dim_time__id 
WHERE dim_time__week_start_date IN (
    SELECT generate_series('2015-07-06', '2016-07-03', '7 day'::interval)
);

The output:

 dim_time__week_start_date   f_agent__total_users
"2015-07-13";                      3
"2015-07-13";                     33
"2015-08-10";                     12

But I only need the first value in case 2 week_start_dates are same. The output I am expecting here is:

 dim_time__week_start_date   f_agent__total_users
"2015-07-13";                      3
"2015-08-10";                     12

How can I do this?

Answer :

Another way – if your version is 9.3+ – is to use the LATERAL syntax:

SELECT * 
FROM 
  generate_series('2015-07-06', '2016-07-03', '7 day'::interval) AS d (dt)
  , LATERAL  
    ( SELECT fa.*, dt.*
      FROM dim_time AS dt
        INNER JOIN f_agent AS fa
          ON fa.f_agent__dim_time_id = dt.dim_time__id
      WHERE dim_time__week_start_date = d.dt
      ORDER BY dt.dim_time__date
      LIMIT 1
    ) AS dt ;

Tested at SQLfiddle.

The super quick fix will be to apply DISTINCT ON() to your query.

DISTINCT ON

Postgres allows the use of the DISTINCT qualifier, but allows it to be applied across a subset of your returned columns from your SELECT query.

In this case, you could simply rewrite your query as

SELECT DISTINCT ON(dim_time__week_start_date) 
dim_time__week_start_date, f_agent__total_users 
FROM f_agent
JOIN dim_time 
ON f_agent__dim_time_id = dim_time__id 
WHERE dim_time__week_start_date IN (
SELECT generate_series('2015-07-06', '2016-07-03', '7 day'::interval)
);

Refer to the Postgres documentation on DISTINCT ON within SELECT for more info.

A word of caution: By applying DISTINCT ON, it will keep only the first row, regardless of how the order is returned by your query. Unless you apply some ORDER BY criteria, you may end up with inconsistent results!

In effect, you could run the query once and get f_agent__total_users as 3, but run it later and get f_agent__total_users as 33. I would highly recommend some additional predicate criteria to ensure proper and consistent results are returned.

DB design

Simplify your design radically:

CREATE TABLE dim_time (
  dim_time__id bigserial NOT NULL PRIMARY KEY,  -- assuming PK
  dim_time__date date NOT NULL,
  dim_time__guid uuid NOT NULL DEFAULT uuid_generate_v4(),
);

These columns are functionally dependent and just noise bloating your table:

dim_time__month_start_date date NOT NULL,
dim_time__week_start_date date NOT NULL,
dim_time__quarter_start_date date NOT NULL,
dim_time__year_start_date date NOT NULL,

The redundant columns wouldn’t even help performance of the query at hand. See below.
Use the function date_trunc() instead, that will be faster overall.

If you really need a table of this form add a VIEW:

CREATE OR REPLACE TEMP VIEW dim_time_plus AS
SELECT *
     , date_trunc('week',    dim_time__date)::date AS dim_time__week_start_date
     , date_trunc('month',   dim_time__date)::date AS dim_time__month_start_date
     , date_trunc('quarter', dim_time__date)::date AS dim_time__quarter_start_date
     , date_trunc('year',    dim_time__date)::date AS dim_time__year_start_date
FROM   dim_time

Or even a MATERIALIZED VIEW if you insist on having the redundant columns persisted.

I see no reason for bigserial and a guid in that table. There are not nearly enough days in known time to justify the big data types. There must be other considerations not in the question. Or use integer (serial) instead of bigint (bigserial).

It also contradicts your FK from f_agent.f_agent__dim_time_id, which is defined integer, not bigint.

From what we see here, your whole DB design might be improved.

Query

Similar to what @ypercube and @Chris provided, but based on the decluttered design and improved.

For many rows per time range

SELECT x.d AS dim_time__week_start_date, l.f_agent__total_users
FROM  (SELECT d::date FROM generate_series(date '2015-07-06'
                                         , date '2016-07-03'
                                         , interval '7 day') d) x
     , LATERAL (
   SELECT f.f_agent__total_users 
   FROM   dim_time d
   JOIN   f_agent f ON f.f_agent__dim_time_id = d.dim_time__id
   WHERE  d.dim_time__date >= x.d
   AND    d.dim_time__date <  x.d + 7  -- or + interval '1 month' for month etc
   ORDER  BY d.dim_time__date, d.dim_time__id, f.f_agent__id  -- to break ties
   LIMIT  1
   ) l;

Details for this technique:

For few rows per time range

SELECT DISTINCT ON (1) 
       date_trunc('week', d.dim_time__date) AS dim_time__week_start_date
     , f.f_agent__total_users 
FROM   f_agent f
JOIN   dim_time d ON f.f_agent__dim_time_id = d.dim_time__id
WHERE  d.dim_time__date >= date '2015-07-06'
AND    d.dim_time__date <  date '2016-07-03' + 7
ORDER  BY 1, d.dim_time__date, d.dim_time__id, f.f_agent__id;

Details for DISTINCT ON:

I added presumed PK columns of each table (d.dim_time__date, f.f_agent__id) in the ORDER BY to get a deterministic result. Replace that with what ever expression you actually need to use to break ties.

SQL Fiddle

If every row in dim_time is referenced by at least one row in f_agent, you can join to f_agent later to make it a bit faster.

Indexes

To make these queries fast add multicolumn indexes. Most likely candidates:

CREATE INDEX dim_time_foo_idx ON dim_time (dim_time__date
                                         , dim_time__id);
CREATE INDEX f_agent_foo_idx ON f_agent (f_agent__dim_time_id
                                       , f_agent__id
                                       , f_agent__total_users);

Details depend on undisclosed cardinalities and constraints and the full picture of your situation.

Leave a Reply

Your email address will not be published. Required fields are marked *