PostgreSQL window functions very slow on big table

Posted on

Question :

I have a PostgreSQL table constructed as

device_id | point     | dt_edit
100       | geometry  | timestamp
101       | geometry  | timestamp
100       | geometry  | timestamp
102       | geometry  | timestamp
102       | geometry  | timestamp
101       | geometry  | timestamp

I need to select the last 2 records ordered by dt_edit from unique device_id. This query works very slow, on 1 billion records – 500 seconds:

 SELECT dt.device_id,
   FROM ( SELECT gps_data.device_id,
            rank() OVER (PARTITION BY gps_data.device_id
                         ORDER BY gps_data.dt_edit DESC) AS rank
           FROM gps_data) dt
  WHERE dt.rank <= 2

Any ideas?

Answer :

If we can assume a device table holding all devices of interest. Example:

CREATE TABLE device (device_id int, device text);
INSERT INTO device (device_id, device) VALUES
  (100, 'a')
, (101, 'b')
, (102, 'c');

The query can be very simple:

SELECT d.device_id, g.point, g.dt_edit
FROM   device d
,      LATERAL (
   SELECT point, dt_edit
   FROM   gps_data
   WHERE  device_id = d.device_id
   ORDER  BY dt_edit DESC NULLS LAST -- more items?
   LIMIT  2
   ) g;

Assuming a much smaller cardinality in table device than in table gps_data, i.e. many rows per device in the latter.

If dt_edit is not UNIQUE, add more items to ORDER BY to disambiguate the sort order.

All you need for this to be fast is a multicolumn index;

CREATE INDEX gps_data_foo_idx
ON gps_data (device_id, dt_edit DESC NULLS LAST); -- more items? 

The NULLS LAST modifier is useful if the column can be NULL, but does not hurt in any case. Either way, the query must match the index.

If you don’t have a device table, I would suggest to create one (and keep it current).
But there are ways to make this fast even without device table …


Leave a Reply

Your email address will not be published. Required fields are marked *