Question :
I am using Postgres 9.3.9. I have 2 tables :
f_agent :
CREATE TABLE f_agent (
f_agent__id bigserial NOT NULL,
f_agent__total_users integer NOT NULL,
f_agent__dim_time_id integer NOT NULL,
.... other unnecessary columns
);
dim_time :
CREATE TABLE dim_time (
dim_time__id bigserial NOT NULL,
dim_time__date date NOT NULL,
dim_time__month_start_date date NOT NULL,
dim_time__week_start_date date NOT NULL,
dim_time__quarter_start_date date NOT NULL,
dim_time__year_start_date date NOT NULL,
dim_time__guid uuid NOT NULL DEFAULT uuid_generate_v4(),
);
Now, f_agent__dim_time_id
field in f_agent
is FK referencing dim_time__id
of dim_time
.
My dim_time
table has a date and its related week_start_date
, month_start_date
etc. For example: if dim_time__date
is 14-07-2015
then dim_time__week_start_date
will be 13-07-2015
(monday is start of week) and start_of_month
is 01-07-2015
(always the first).
I have to find f_agent__total_users
from f_agent given a date range and a dimension (week / month / quarter).
What I’ve tried:
SELECT dim_time__week_start_date, f_agent__total_users
FROM f_agent
JOIN dim_time
ON f_agent__dim_time_id = dim_time__id
WHERE dim_time__week_start_date IN (
SELECT generate_series('2015-07-06', '2016-07-03', '7 day'::interval)
);
The output:
dim_time__week_start_date f_agent__total_users
"2015-07-13"; 3
"2015-07-13"; 33
"2015-08-10"; 12
But I only need the first value in case 2 week_start_dates
are same. The output I am expecting here is:
dim_time__week_start_date f_agent__total_users
"2015-07-13"; 3
"2015-08-10"; 12
How can I do this?
Answer :
Another way – if your version is 9.3+ – is to use the LATERAL
syntax:
SELECT *
FROM
generate_series('2015-07-06', '2016-07-03', '7 day'::interval) AS d (dt)
, LATERAL
( SELECT fa.*, dt.*
FROM dim_time AS dt
INNER JOIN f_agent AS fa
ON fa.f_agent__dim_time_id = dt.dim_time__id
WHERE dim_time__week_start_date = d.dt
ORDER BY dt.dim_time__date
LIMIT 1
) AS dt ;
Tested at SQLfiddle.
The super quick fix will be to apply DISTINCT ON()
to your query.
DISTINCT ON
Postgres allows the use of the DISTINCT
qualifier, but allows it to be applied across a subset of your returned columns from your SELECT
query.
In this case, you could simply rewrite your query as
SELECT DISTINCT ON(dim_time__week_start_date)
dim_time__week_start_date, f_agent__total_users
FROM f_agent
JOIN dim_time
ON f_agent__dim_time_id = dim_time__id
WHERE dim_time__week_start_date IN (
SELECT generate_series('2015-07-06', '2016-07-03', '7 day'::interval)
);
Refer to the Postgres documentation on DISTINCT ON
within SELECT
for more info.
A word of caution: By applying DISTINCT ON
, it will keep only the first row, regardless of how the order is returned by your query. Unless you apply some ORDER BY
criteria, you may end up with inconsistent results!
In effect, you could run the query once and get f_agent__total_users
as 3
, but run it later and get f_agent__total_users
as 33
. I would highly recommend some additional predicate criteria to ensure proper and consistent results are returned.
DB design
Simplify your design radically:
CREATE TABLE dim_time (
dim_time__id bigserial NOT NULL PRIMARY KEY, -- assuming PK
dim_time__date date NOT NULL,
dim_time__guid uuid NOT NULL DEFAULT uuid_generate_v4(),
);
These columns are functionally dependent and just noise bloating your table:
dim_time__month_start_date date NOT NULL,
dim_time__week_start_date date NOT NULL,
dim_time__quarter_start_date date NOT NULL,
dim_time__year_start_date date NOT NULL,
The redundant columns wouldn’t even help performance of the query at hand. See below.
Use the function date_trunc()
instead, that will be faster overall.
If you really need a table of this form add a VIEW
:
CREATE OR REPLACE TEMP VIEW dim_time_plus AS
SELECT *
, date_trunc('week', dim_time__date)::date AS dim_time__week_start_date
, date_trunc('month', dim_time__date)::date AS dim_time__month_start_date
, date_trunc('quarter', dim_time__date)::date AS dim_time__quarter_start_date
, date_trunc('year', dim_time__date)::date AS dim_time__year_start_date
FROM dim_time
Or even a MATERIALIZED VIEW
if you insist on having the redundant columns persisted.
I see no reason for bigserial
and a guid
in that table. There are not nearly enough days in known time to justify the big data types. There must be other considerations not in the question. Or use integer
(serial
) instead of bigint
(bigserial
).
It also contradicts your FK from f_agent.f_agent__dim_time_id
, which is defined integer
, not bigint
.
From what we see here, your whole DB design might be improved.
Query
Similar to what @ypercube and @Chris provided, but based on the decluttered design and improved.
For many rows per time range
SELECT x.d AS dim_time__week_start_date, l.f_agent__total_users
FROM (SELECT d::date FROM generate_series(date '2015-07-06'
, date '2016-07-03'
, interval '7 day') d) x
, LATERAL (
SELECT f.f_agent__total_users
FROM dim_time d
JOIN f_agent f ON f.f_agent__dim_time_id = d.dim_time__id
WHERE d.dim_time__date >= x.d
AND d.dim_time__date < x.d + 7 -- or + interval '1 month' for month etc
ORDER BY d.dim_time__date, d.dim_time__id, f.f_agent__id -- to break ties
LIMIT 1
) l;
Details for this technique:
For few rows per time range
SELECT DISTINCT ON (1)
date_trunc('week', d.dim_time__date) AS dim_time__week_start_date
, f.f_agent__total_users
FROM f_agent f
JOIN dim_time d ON f.f_agent__dim_time_id = d.dim_time__id
WHERE d.dim_time__date >= date '2015-07-06'
AND d.dim_time__date < date '2016-07-03' + 7
ORDER BY 1, d.dim_time__date, d.dim_time__id, f.f_agent__id;
Details for DISTINCT ON:
I added presumed PK columns of each table (d.dim_time__date, f.f_agent__id
) in the ORDER BY
to get a deterministic result. Replace that with what ever expression you actually need to use to break ties.
If every row in dim_time
is referenced by at least one row in f_agent
, you can join to f_agent
later to make it a bit faster.
Indexes
To make these queries fast add multicolumn indexes. Most likely candidates:
CREATE INDEX dim_time_foo_idx ON dim_time (dim_time__date
, dim_time__id);
CREATE INDEX f_agent_foo_idx ON f_agent (f_agent__dim_time_id
, f_agent__id
, f_agent__total_users);
Details depend on undisclosed cardinalities and constraints and the full picture of your situation.