Question :
I have a table with about 300 columns and about 107 rows and I have to retrieve changed rows only, ordered by time (not distinct, but changed). The result set may be limited to 100 rows as well.
I’ve tried the following query with lag
window function:
-- Suppose, we want to retrieve data from column1, column2 and column3 fields
-- There may be other fields, though
SELECT
w."stamp",
w."column1",
w."column2",
w."column3"
FROM (
SELECT
o.stamp,
o.obj_id,
"o"."column1",
"o"."column2",
"o"."column3",
lag(o."column1") OVER (ORDER BY stamp) AS "_prev_column1",
lag(o."column2") OVER (ORDER BY stamp) AS "_prev_column2",
lag(o."column3") OVER (ORDER BY stamp) AS "_prev_column3"
FROM "table_name" o
WHERE o.stamp BETWEEN '01.12.2015 00:00' AND '23.01.2016 00:00'
ORDER BY o.stamp DESC
) AS w
WHERE
w.obj_id = 42 AND w.stamp BETWEEN '01.12.2015 00:00' AND '23.01.2016 00:00' AND
("w"."_prev_column_1", "w"."_prev_column_2", "w"."_prev_column_3") IS DISTINCT FROM
("w"."column_1", "w"."column_2", "w"."column_3")
ORDER BY w.stamp DESC
LIMIT 100;
However, it takes too much time to complete. Is it possible to optimize this query, or the problem should be solved in another way (e.g., custom function)?
Table definition:
CREATE TABLE table_name (
id BIGINT PRIMARY KEY NOT NULL DEFAULT nextval('ds_dyn_sequence'::regclass),
obj_id BIGINT NOT NULL,
stamp TIMESTAMP NOT NULL DEFAULT now(),
column1 BIGINT,
column2 BIGINT,
column3 BIGINT
-- Other fields, all BIGINT NULL
);
CREATE UNIQUE INDEX obj_id_stamp_key ON table_name USING BTREE (obj_id, stamp);
The table contains about 104 rows per a hour. The table will be limited to three months, so, the total number of rows will approximately be 2*107.
PostgreSQL version: 9.3
Answer :
If the version was 9.3 or newer, I would try this rewrite of the query:
SELECT
w.stamp,
w.column1,
w.column2,
w.column3
FROM
"table_name" AS w
JOIN LATERAL
( SELECT p.column1, p.column2, p.column3
FROM "table_name" AS p
WHERE p.stamp < w.stamp
AND p.stamp >= '2015-12-01'::timestamp
ORDER BY p.stamp DESC
LIMIT 1
) AS p
ON (w.column1, w.column2, w.column3)
IS DISTINCT FROM
(p.column1, p.column2, p.column3)
WHERE w.obj_id = 42
AND w.stamp >= '2015-12-01'::timestamp
AND w.stamp <= '2016-01-23'::timestamp
ORDER BY w.stamp DESC
LIMIT 100 ;
with the index you have and with an index on (obj_id, stamp DESC)
Unrelated to efficiency: are you sure that you want to use BETWEEN
with timestamps? It’s more common to work with inclusive-exclusive ranges. In other words, shouldn’t this condition:
AND w.stamp <= '2016-01-23'::timestamp
be?:
AND w.stamp < '2016-01-23'::timestamp
With BETWEEN
(and <=
), you are searching though a range that is 53 whole days long + 1 microsecond (or whatever the accuracy of timestamp
is).
There’s probably no better way to get the expected result.
Some remarks:
You don’t have to repeat the w.stamp BETWEEN '01.12.2015 00:00' AND '23.01.2016 00:00'
in the outer query.
Are you sure that a missing w.obj_id = 42
is correct? If this was part of the Derived Table the existing index would be useful, otherwise you need one on stamp
.
For this query the index should include the three columns used in LAG
to get best performance, but [if you need additional columns] this might be too costly.
Is there no PARTITION BY
for LAG
?