Imagine a table with roughly a couple of billion rows. A simplified schema for the table would be
started_on: datetime recorded_on: datetime image_id: integer
It’s indexed by
One quirk of the data is that there can be many records with identical
started_on values but they may have different
image_id values. Further, for a given
started_on, we only want the most recently recorded record.
Currently we get this by
1) pulling all the records within the
started_on range, and then
2) using code to grab out the most recently
recorded_on record for each
This works fine, but seems like a waste because we’re serializing and sending all those extra records to our app.
Is there an efficient way to accomplish the job of occluding the overridden records in our initial query?
In Postgres, typically the most efficient way is to use
distinct on ()
select distinct on (started_on) started_on, recorded_on, image_id from the_table order by started_on, recorded_on desc where started_on between ...;