Minimizing SELECT latency in PostgreSQL

Posted on

Question :

I am evaluating Postgres 9.4 performance on a machine with Intel i7 quad-core 3.6 GHz CPU, 8 GB ram, and 7400 rpm HDD (no RAID) running Linux Mint. The DB schema has the following table:

          Table "public.sensor_readings"
  Column  |           Type           | Modifiers
 time     | timestamp with time zone | not null
 value    | numeric                  |
 sensor_id| integer                  | not null

      "sensor_readings_pkey" PRIMARY KEY, btree (sensor_id, "time")

This table has 72 million rows and is 35 GB in size (PKEY index is 25 GB). sensor_id ranges from 0 to 5000.
I need to query sensor values for past two weeks:

FROM sensor_readings 
WHERE sensor_id IN (1,3,8,9,12) 

The problem is that the average query execution time is about 5 minutes even after a 2000-execution warm up. This is two orders of magnitude higher that what I want to achieve!

I did not change Postgres default parameters and did no optimization.

Can anyone suggest what can be wrong with my setup or schema? Are there any optimizations that I can do to minimize SELECT execution time? In particular, what is a proper ratio between table (index?) size and RAM?



Bitmap Heap Scan on sensor_readings  (cost=3535.18..405031.07 rows=113904 width=0) (actual time=5190.213..60196.531 rows=103104 loops=1)
  Recheck Cond: ((sensor_id = ANY ('{1509,1504,1503,1500,1502}'::integer[])) AND ("time" >= (('now'::cstring)::date - '14 days'::interval)) AND ("time" <= ('now'::cstring)::date))
  Heap Blocks: exact=47786
  Buffers: shared hit=12 read=48491
  ->  Bitmap Index Scan on sensor_readings_pkey  (cost=0.00..3506.70 rows=113904 width=0) (actual time=5165.932..5165.932 rows=103750 loops=1)
        Index Cond: ((sensor_id = ANY ('{1509,1504,1503,1500,1502}'::integer[])) AND ("time" >= (('now'::cstring)::date - '14 days'::interval)) AND ("time" <= ('now'::cstring)::date))
        Buffers: shared hit=12 read=705

Planning time: 24.887 ms
Execution time: 60205.108 ms

Answer :

The main problem is that you have 60 GB of data which you want to access quickly on a single slow HDD (I’ve never heard of 7400 rpm, is it 5400 or 7200?).

You can partition on the time, so the last two weeks of data are grouped together in tight set.

Or you could try clustering on the time instead of partitioning on it. You could either build on index on the date column, cluster on that, then drop the index; or you could just make a new table by create table asdfkj as select * from sensor_readings order by time.

I would expect the table to already be pretty well clustered on time, but it doesn’t seem to be. How did this table get populated originally? How does it get kept up to date now?

But all of these methods are pretty expensive (in your time) to avoid buying a better IO system.

You should try to create an index on time, and see if this improves the execution time of your query:

CREATE INDEX time_index ON sensor_readings(time);

An alternative is to partition or cluster the table on time, as described in another answer.

Leave a Reply

Your email address will not be published. Required fields are marked *