Retrieving changed rows only from a table with 10^7 rows

Posted on

Question :

I have a table with about 300 columns and about 107 rows and I have to retrieve changed rows only, ordered by time (not distinct, but changed). The result set may be limited to 100 rows as well.

I’ve tried the following query with lag window function:

-- Suppose, we want to retrieve data from column1, column2 and column3 fields
-- There may be other fields, though
SELECT
  w."stamp",
  w."column1",
  w."column2",
  w."column3"
FROM (
       SELECT
         o.stamp,
         o.obj_id,
         "o"."column1",
         "o"."column2",
         "o"."column3",
         lag(o."column1") OVER (ORDER BY stamp) AS "_prev_column1",
         lag(o."column2") OVER (ORDER BY stamp) AS "_prev_column2",
         lag(o."column3") OVER (ORDER BY stamp) AS "_prev_column3" 
       FROM "table_name" o
       WHERE o.stamp BETWEEN '01.12.2015 00:00' AND '23.01.2016 00:00'
       ORDER BY o.stamp DESC
     ) AS w
WHERE 
  w.obj_id = 42 AND w.stamp BETWEEN '01.12.2015 00:00' AND '23.01.2016 00:00' AND 
  ("w"."_prev_column_1", "w"."_prev_column_2", "w"."_prev_column_3") IS DISTINCT FROM
  ("w"."column_1", "w"."column_2", "w"."column_3")
ORDER BY w.stamp DESC 
LIMIT 100;

However, it takes too much time to complete. Is it possible to optimize this query, or the problem should be solved in another way (e.g., custom function)?


Table definition:

CREATE TABLE table_name (
  id BIGINT PRIMARY KEY NOT NULL DEFAULT nextval('ds_dyn_sequence'::regclass),
  obj_id BIGINT NOT NULL,
  stamp TIMESTAMP NOT NULL DEFAULT now(),
  column1 BIGINT,
  column2 BIGINT,
  column3 BIGINT
  -- Other fields, all BIGINT NULL
);

CREATE UNIQUE INDEX obj_id_stamp_key ON table_name USING BTREE (obj_id, stamp);

The table contains about 104 rows per a hour. The table will be limited to three months, so, the total number of rows will approximately be 2*107.

PostgreSQL version: 9.3

Answer :

If the version was 9.3 or newer, I would try this rewrite of the query:

SELECT
    w.stamp,
    w.column1,
    w.column2,
    w.column3
FROM 
    "table_name" AS w
  JOIN LATERAL
    ( SELECT p.column1, p.column2, p.column3
      FROM "table_name" AS p
      WHERE p.stamp < w.stamp
        AND p.stamp >= '2015-12-01'::timestamp
      ORDER BY p.stamp DESC
      LIMIT 1
    ) AS p
    ON (w.column1, w.column2, w.column3)
       IS DISTINCT FROM
       (p.column1, p.column2, p.column3)
WHERE w.obj_id = 42 
  AND w.stamp >= '2015-12-01'::timestamp
  AND w.stamp <= '2016-01-23'::timestamp 
ORDER BY w.stamp DESC 
LIMIT 100 ;

with the index you have and with an index on (obj_id, stamp DESC)


Unrelated to efficiency: are you sure that you want to use BETWEEN with timestamps? It’s more common to work with inclusive-exclusive ranges. In other words, shouldn’t this condition:

  AND w.stamp <= '2016-01-23'::timestamp 

be?:

  AND w.stamp  < '2016-01-23'::timestamp 

With BETWEEN (and <=), you are searching though a range that is 53 whole days long + 1 microsecond (or whatever the accuracy of timestamp is).

There’s probably no better way to get the expected result.

Some remarks:

You don’t have to repeat the w.stamp BETWEEN '01.12.2015 00:00' AND '23.01.2016 00:00' in the outer query.

Are you sure that a missing w.obj_id = 42 is correct? If this was part of the Derived Table the existing index would be useful, otherwise you need one on stamp.

For this query the index should include the three columns used in LAG to get best performance, but [if you need additional columns] this might be too costly.

Is there no PARTITION BY for LAG?

Leave a Reply

Your email address will not be published. Required fields are marked *