Query optimization: how to efficiently occlude overridden records?

Posted on

Question :

Imagine a table with roughly a couple of billion rows. A simplified schema for the table would be

started_on: datetime
recorded_on: datetime
image_id: integer

It’s indexed by started_on.

One quirk of the data is that there can be many records with identical started_on values but they may have different recorded_on and image_id values. Further, for a given started_on, we only want the most recently recorded record.

Currently we get this by

1) pulling all the records within the started_on range, and then

2) using code to grab out the most recently recorded_on record for each started_on.

This works fine, but seems like a waste because we’re serializing and sending all those extra records to our app.

Is there an efficient way to accomplish the job of occluding the overridden records in our initial query?

Answer :

In Postgres, typically the most efficient way is to use distinct on ()

select distinct on (started_on) started_on, recorded_on, image_id
from the_table
order by started_on, recorded_on desc
where started_on between ...;

Leave a Reply

Your email address will not be published. Required fields are marked *