Is there any way to index the following query?
SELECT run_id, MAX ( frame ) , MAX ( time ) FROM run.frames_stat GROUP BY run_id;
I’ve tried creating sorted (non-composite) indexes on
time, and an index on
run_id, but the query planner doesn’t use them.
- Unfortunately (and for reasons I won’t get into) I cannot change the query
frames_stattable has 42 million rows
- The table is unchanging (no further inserts/deletes will ever take place)
- The query was always slow, it’s just gotten slower because this dataset is larger than in the past.
- There are no indexes on the table
- We are using Postgres 9.4
- The db’s “work_mem” size is 128MB (if that’s relevant).
- Hardware: 130GB Ram, 10 core Xeon
CREATE TABLE run.frame_stat ( id bigint NOT NULL, run_id bigint NOT NULL, frame bigint NOT NULL, heap_size bigint NOT NULL, "time" timestamp without time zone NOT NULL, CONSTRAINT frame_stat_pkey PRIMARY KEY (id) )
HashAggregate (cost=1086240.000..1086242.800 rows=280 width=24) (actual time=14182.426..14182.545 rows=280 loops=1) Group Key: run_id -> Seq Scan on zulu (cost=0.000..770880.000 rows=42048000 width=24) (actual time=0.037..4077.182 rows=42048000 loops=1)
If you cannot change the query at all, that’s too bad. You won’t get a good solution.
If you had not table-qualified the table (
frames_stat), you could create a materialized view (see below) with the same name in another schema (or just a temporary one) and adapt the
search_path (optionally just in sessions where this is desirable) – for hugely superior performance.
Here’s a recipe for such a technique:
@Joishi’s idea with a
RULE would be a measure of (desperate) last resort. But I would rather not go there. Too many pitfalls with unexpected behavior.
Better query / indexes
If you could change the query, you should try to emulate a loose index scan:
This is even more efficient when based on a separate table with one row per relevant
run_id – let’s call it
run_tbl. Create it if you don’t have it, yet!
Implemented with correlated subqueries:
SELECT run_id , (SELECT frame FROM run.frames_stat WHERE run_id = r.run_id ORDER BY frame DESC NULLS LAST LIMIT 1) AS max_frame , (SELECT "time" FROM run.frames_stat WHERE run_id = r.run_id ORDER BY "time" DESC NULLS LAST LIMIT 1) AS max_time FROM run_tbl r;
Create two multicolumn indexes with matching sort order for lightening performance:
CREATE index fun_frame_idx ON run.frames_stat (run_id, frame DESC NULLS LAST); CREATE index fun_frame_idx ON run.frames_stat (run_id, "time" DESC NULLS LAST);
NULLS LAST is only necessary if there can be null values. But it won’t hurt either way.
With only 280 distinct
run_id, this will be very fast.
Or, based on these key pieces of information:
The “frames_stat” table has 42 million rows
rows=280 — number of returned rows = disctinct run_id
The table is unchanging (no inserts/deletes)
MATERIALIZED VIEW, it will be tiny (only 280 rows) and super fast.
You still need to change the query to base it on the MV instead of the table.
Aside: never use reserved words like
time (in standard SQL) as identifier.
You could try creating an
INSTEAD OF SELECT rule on the table .. although this might break the application (depending on what all actually uses the table in question)
CREATE RULE "RETURN_MODIFIED_SELECT" AS ON SELECT TO run.frames_stat DO INSTEAD <MY QUERY FROM BELOW>;
I have not used
RULEs that much, personally – so I may have this COMPLETELY wrong .. someone please correct me in a comment if that’s the case.
Quote from the documentation:
Presently, ON SELECT rules must be unconditional INSTEAD rules and must have actions that consist of a single SELECT command. Thus, an ON SELECT rule effectively turns the table into a view, whose visible contents are the rows returned by the rule’s SELECT command rather than whatever had been stored in the table (if anything). It is considered better style to write a CREATE VIEW command than to create a real table and define an ON SELECT rule for it.
If my query from below isn’t considered a “single select command” (which it may not, considering it uses CTEs), then you could try writing a function that encapsulates my query and having the rule return the select of the function.
ORIGINAL POST BELOW – It was added in OP that query is not able to be changed .. so below will not work for OP (but leaving in case others will benefit from it)
Try splitting it up into two separate queries..
WITH max_frame AS ( SELECT run_id, MAX(frame) AS max_frame FROM run.frames_stat GROUP BY run_id ), max_time AS ( SELECT run_id, MAX(time) AS max_time FROM run.frames_stat GROUP BY run_id ) SELECT a.run_id, a.max_frame, b.max_time FROM max_frame a JOIN max_time b ON a.run_id = b.run_id
In terms of indexes .. an index on run_id MAY be enough for both queries .. but if it’s not, try having two indexes … one on (run_id, frame) and one on (run_id, time)
I BELIEVE this will help improve your query – postgres optimizer is probably assuming it will need to scan most of the rows of the table (even though it knows an index is available) because it will need to find the
MAX (frame) AND the
MAX (time) in one pass.. By splitting it up as I have, it will know that it only needs to find one
MAX value (instead of two), so should use the index to find it.
If that doesn’t, then you really need to provide the data requested in the link that @a_horse_with_no_name supplied..