Question :
Is there any way to index the following query?
SELECT run_id, MAX ( frame ) , MAX ( time ) FROM run.frames_stat GROUP BY run_id;
I’ve tried creating sorted (non-composite) indexes on frame
and time
, and an index on run_id
, but the query planner doesn’t use them.
Misc info:
- Unfortunately (and for reasons I won’t get into) I cannot change the query
- The
frames_stat
table has 42 million rows - The table is unchanging (no further inserts/deletes will ever take place)
- The query was always slow, it’s just gotten slower because this dataset is larger than in the past.
- There are no indexes on the table
- We are using Postgres 9.4
- The db’s “work_mem” size is 128MB (if that’s relevant).
- Hardware: 130GB Ram, 10 core Xeon
Schema:
CREATE TABLE run.frame_stat (
id bigint NOT NULL,
run_id bigint NOT NULL,
frame bigint NOT NULL,
heap_size bigint NOT NULL,
"time" timestamp without time zone NOT NULL,
CONSTRAINT frame_stat_pkey PRIMARY KEY (id)
)
Explain analyze:
HashAggregate (cost=1086240.000..1086242.800 rows=280 width=24) (actual time=14182.426..14182.545 rows=280 loops=1)
Group Key: run_id
-> Seq Scan on zulu (cost=0.000..770880.000 rows=42048000 width=24) (actual time=0.037..4077.182 rows=42048000 loops=1)
Answer :
Too bad
If you cannot change the query at all, that’s too bad. You won’t get a good solution.
If you had not table-qualified the table (run.
frames_stat
), you could create a materialized view (see below) with the same name in another schema (or just a temporary one) and adapt the search_path
(optionally just in sessions where this is desirable) – for hugely superior performance.
Here’s a recipe for such a technique:
@Joishi’s idea with a RULE
would be a measure of (desperate) last resort. But I would rather not go there. Too many pitfalls with unexpected behavior.
Better query / indexes
If you could change the query, you should try to emulate a loose index scan:
This is even more efficient when based on a separate table with one row per relevant run_id
– let’s call it run_tbl
. Create it if you don’t have it, yet!
Implemented with correlated subqueries:
SELECT run_id
, (SELECT frame
FROM run.frames_stat
WHERE run_id = r.run_id
ORDER BY frame DESC NULLS LAST
LIMIT 1) AS max_frame
, (SELECT "time"
FROM run.frames_stat
WHERE run_id = r.run_id
ORDER BY "time" DESC NULLS LAST
LIMIT 1) AS max_time
FROM run_tbl r;
Create two multicolumn indexes with matching sort order for lightening performance:
CREATE index fun_frame_idx ON run.frames_stat (run_id, frame DESC NULLS LAST);
CREATE index fun_frame_idx ON run.frames_stat (run_id, "time" DESC NULLS LAST);
NULLS LAST
is only necessary if there can be null values. But it won’t hurt either way.
With only 280 distinct run_id
, this will be very fast.
MATERIALIZED VIEW
Or, based on these key pieces of information:
The “frames_stat” table has 42 million rows
rows=280 — number of returned rows = disctinct run_id
The table is unchanging (no inserts/deletes)
Use a MATERIALIZED VIEW
, it will be tiny (only 280 rows) and super fast.
You still need to change the query to base it on the MV instead of the table.
Aside: never use reserved words like time
(in standard SQL) as identifier.
You could try creating an INSTEAD OF SELECT
rule on the table .. although this might break the application (depending on what all actually uses the table in question)
CREATE RULE "RETURN_MODIFIED_SELECT" AS
ON SELECT TO run.frames_stat
DO INSTEAD
<MY QUERY FROM BELOW>;
I have not used RULE
s that much, personally – so I may have this COMPLETELY wrong .. someone please correct me in a comment if that’s the case.
Quote from the documentation:
Presently, ON SELECT rules must be unconditional INSTEAD rules and must have actions that consist of a single SELECT command. Thus, an ON SELECT rule effectively turns the table into a view, whose visible contents are the rows returned by the rule’s SELECT command rather than whatever had been stored in the table (if anything). It is considered better style to write a CREATE VIEW command than to create a real table and define an ON SELECT rule for it.
If my query from below isn’t considered a “single select command” (which it may not, considering it uses CTEs), then you could try writing a function that encapsulates my query and having the rule return the select of the function.
ORIGINAL POST BELOW – It was added in OP that query is not able to be changed .. so below will not work for OP (but leaving in case others will benefit from it)
Try splitting it up into two separate queries..
WITH max_frame AS (
SELECT run_id, MAX(frame) AS max_frame FROM run.frames_stat GROUP BY run_id
), max_time AS (
SELECT run_id, MAX(time) AS max_time FROM run.frames_stat GROUP BY run_id
) SELECT
a.run_id, a.max_frame, b.max_time
FROM max_frame a
JOIN max_time b ON a.run_id = b.run_id
In terms of indexes .. an index on run_id MAY be enough for both queries .. but if it’s not, try having two indexes … one on (run_id, frame) and one on (run_id, time)
I BELIEVE this will help improve your query – postgres optimizer is probably assuming it will need to scan most of the rows of the table (even though it knows an index is available) because it will need to find the MAX (frame)
AND the MAX (time)
in one pass.. By splitting it up as I have, it will know that it only needs to find one MAX
value (instead of two), so should use the index to find it.
If that doesn’t, then you really need to provide the data requested in the link that @a_horse_with_no_name supplied..