Optimizing a query which runs through all hours in a day

Posted on

Question :

I need help optimizing a query. I’m using PostgreSQL 9.3.4 currently but can upgrade to 9.4 if needed.

I have a table with 60+ millions of records which looks like this:

                                                          Table "public.snapshots"
   Column   |           Type           |                         Modifiers                          | Storage  | Stats target | Description 
------------+--------------------------+------------------------------------------------------------+----------+--------------+-------------
 id         | integer                  | not null default nextval('snapshots_new_id_seq'::regclass) | plain    |              | 
 camera_id  | integer                  | not null                                                   | plain    |              | 
 created_at | timestamp with time zone | not null                                                   | plain    |              | 
 notes      | text                     |                                                            | extended |              | 
 data       | bytea                    | not null                                                   | extended |              | 
 is_public  | boolean                  | not null default false                                     | plain    |              | 
Indexes:
    "snapshots_new_created_at_camera_id_index" UNIQUE, btree (created_at, camera_id)
Foreign-key constraints:
    "snapshots_new_camera_id_fkey" FOREIGN KEY (camera_id) REFERENCES cameras(id) ON DELETE CASCADE
Has OIDs: no

There can be between 0 and 3600 snapshot records in an hour. For this query I’m interested only in knowing which hours in a given day for a given camera_id have 1 or more snapshot records (the actual count is unimportant).

Currently the application is implemented to execute one query for each hour in the day, like this:

SELECT count(*) AS "count" FROM "snapshots" WHERE (("snapshots"."camera_id" = 4809) AND ("created_at" >= '2015-05-24 23:00:00 UTC') AND ("created_at" <= '2015-05-24 23:59:59 UTC'));
SELECT count(*) AS "count" FROM "snapshots" WHERE (("snapshots"."camera_id" = 4809) AND ("created_at" >= '2015-05-25 00:00:00 UTC') AND ("created_at" <= '2015-05-25 00:59:59 UTC'));
...
SELECT count(*) AS "count" FROM "snapshots" WHERE (("snapshots"."camera_id" = 4809) AND ("created_at" >= '2015-05-25 22:00:00 UTC') AND ("created_at" <= '2015-05-25 22:59:59 UTC'));

Explain analyze for one query: http://explain.depesz.com/s/9tbP

I tried optimizing this and ended up with one query which seemed like it was what I needed:

SELECT count(*) AS "count" FROM "snapshots" WHERE (("created_at" > '2015-05-06 23:00:00.000000+0000') AND ("created_at" < '2015-05-08 23:00:00.000000+0000')) GROUP BY date_trunc('hour', created_at);

Explain analyze for this query: http://explain.depesz.com/s/cVUK

But this is actually 10x times slower than the above 24 queries combined.

What am I doing wrong? Should I forget COUNT since I’m not interested in the actual count? How would the query look like then?

Edit: Thanks for all the comments and answers, I learned multiple things from you! I wish I could accept all three answers, I chose ypercube’s as it seems most efficient and flexible.

Answer :

A variation on @Akash’s answer which uses the LATERAL syntax and results in a better execution plan (check the Index Only Scan using idx2_snapshots on snapshots in the plan below):

SELECT 
      hour AS start_hour,
      hour + interval '1 hour' AS end_hour
FROM 
      generate_series('2015-01-01'::timestamp, 
                      '2015-01-02 23:00:00'::timestamp, 
                      '1 hour') AS hour 
    , LATERAL
        ( SELECT 1 
          FROM snapshots 
          WHERE camera_id = 3 
            AND created_at >= hour 
            AND created_at < hour + interval '1 hour' 
          LIMIT 1 
        ) AS x ;

Tested in SQLfiddle (query 2). The plan:

Nested Loop (cost=0.43..4083.77 rows=1000 width=8)
-> Function Scan on generate_series hour (cost=0.00..10.00 rows=1000 width=8)
-> Limit (cost=0.43..4.05 rows=1 width=0)
-> Index Only Scan using idx2_snapshots on snapshots (cost=0.43..7740.98 rows=2139 width=0)
Index Cond: ((camera_id = 3) AND (created_at >= hour.hour) AND (created_at < (hour.hour + '01:00:00'::interval)))

Just an idea: create a table with all hours in the day (0-23).

create table hours(
  hr integer
);

Then find all hours that have snapshots for the given camera_id and date (of course, you will need to substitute your own camera_id and date into the query):

select h.hr, 1 as camera_id
from hours h
where exists (
    select 1 
    from snapshots s
    where s.camera_id = 1
    and s.created_at between to_timestamp ('2015-05-01 ' || to_char(h.hr, '00') || ':00:00', 'YYYY-MM-DD HH24:MI:SS')
                             and to_timestamp ('2015-05-01 ' || to_char(h.hr, '00') || ':59:59', 'YYYY-MM-DD HH24:MI:SS')
)

See this on SQLFiddle. To return a count(*) from the table the database has to go through all the rows that satisfy the condition to count them; exists() should limit this to finding the first row that satisfies the condition.

I tried creating a small sample of your database, we should probably force the planner to use a semi-join nested loop which would stop processing further rows as soon as it finds the first row for a given hour, camera_id as zugguy suggested.

You can have a workaround with the temporary table using generate_series to get hours for the given datetimes.

SELECT 
  start_hour,
  end_hour
FROM
  (SELECT 
      hour as start_hour,
      hour + interval '1 hour' as end_hour
   FROM 
      generate_series('2015-01-01'::timestamp, '2015-01-02 23:59:59'::timestamp, '1 hour') hour 
  ) as hours
WHERE
  EXISTS( SELECT 
            1 
          FROM 
          snapshots 
          WHERE
            camera_id = 3 
            AND created_at >= start_hour and created_at < end_hour )

Plan, which takes 20 ms for the sample database. Notice the sequence of columns within the index which plays a major factor. I have placed camera_id first, and then created_at under the assumption that the total no. of distinct values for create_at would be far greater than the total no. of camera_ids and we want all the times for a given camera to be close to each other within the index as that’s our limiting factor.

Leave a Reply

Your email address will not be published. Required fields are marked *