Postgres Index a query with MAX and groupBy

Posted on

Question :

Is there any way to index the following query?

SELECT run_id, MAX ( frame ) , MAX ( time ) FROM run.frames_stat GROUP BY run_id;

I’ve tried creating sorted (non-composite) indexes on frame and time, and an index on run_id, but the query planner doesn’t use them.

Misc info:

  • Unfortunately (and for reasons I won’t get into) I cannot change the query
  • The frames_stat table has 42 million rows
  • The table is unchanging (no further inserts/deletes will ever take place)
  • The query was always slow, it’s just gotten slower because this dataset is larger than in the past.
  • There are no indexes on the table
  • We are using Postgres 9.4
  • The db’s “work_mem” size is 128MB (if that’s relevant).
  • Hardware: 130GB Ram, 10 core Xeon

Schema:

CREATE TABLE run.frame_stat (
  id bigint NOT NULL,
  run_id bigint NOT NULL,
  frame bigint NOT NULL,
  heap_size bigint NOT NULL,
  "time" timestamp without time zone NOT NULL,
  CONSTRAINT frame_stat_pkey PRIMARY KEY (id)
)

Explain analyze:

HashAggregate  (cost=1086240.000..1086242.800 rows=280 width=24) (actual time=14182.426..14182.545 rows=280 loops=1)
  Group Key: run_id
  ->  Seq Scan on zulu  (cost=0.000..770880.000 rows=42048000 width=24) (actual time=0.037..4077.182 rows=42048000 loops=1)

Answer :

Too bad

If you cannot change the query at all, that’s too bad. You won’t get a good solution.
If you had not table-qualified the table (run.frames_stat), you could create a materialized view (see below) with the same name in another schema (or just a temporary one) and adapt the search_path (optionally just in sessions where this is desirable) – for hugely superior performance.

Here’s a recipe for such a technique:

@Joishi’s idea with a RULE would be a measure of (desperate) last resort. But I would rather not go there. Too many pitfalls with unexpected behavior.

Better query / indexes

If you could change the query, you should try to emulate a loose index scan:

This is even more efficient when based on a separate table with one row per relevant run_id – let’s call it run_tbl. Create it if you don’t have it, yet!
Implemented with correlated subqueries:

SELECT run_id
    , (SELECT frame
       FROM   run.frames_stat
       WHERE  run_id = r.run_id
       ORDER  BY frame DESC NULLS LAST
       LIMIT  1) AS max_frame
    , (SELECT "time"
       FROM   run.frames_stat
       WHERE  run_id = r.run_id
       ORDER  BY "time" DESC NULLS LAST
       LIMIT  1) AS max_time
FROM   run_tbl r;

Create two multicolumn indexes with matching sort order for lightening performance:

CREATE index fun_frame_idx ON run.frames_stat (run_id, frame DESC NULLS LAST);
CREATE index fun_frame_idx ON run.frames_stat (run_id, "time" DESC NULLS LAST);

NULLS LAST is only necessary if there can be null values. But it won’t hurt either way.

With only 280 distinct run_id, this will be very fast.

MATERIALIZED VIEW

Or, based on these key pieces of information:

The “frames_stat” table has 42 million rows

rows=280 — number of returned rows = disctinct run_id

The table is unchanging (no inserts/deletes)

Use a MATERIALIZED VIEW, it will be tiny (only 280 rows) and super fast.
You still need to change the query to base it on the MV instead of the table.

Aside: never use reserved words like time (in standard SQL) as identifier.

You could try creating an INSTEAD OF SELECT rule on the table .. although this might break the application (depending on what all actually uses the table in question)

CREATE RULE "RETURN_MODIFIED_SELECT" AS
    ON SELECT TO run.frames_stat
    DO INSTEAD
        <MY QUERY FROM BELOW>;

I have not used RULEs that much, personally – so I may have this COMPLETELY wrong .. someone please correct me in a comment if that’s the case.

Quote from the documentation:

Presently, ON SELECT rules must be unconditional INSTEAD rules and must have actions that consist of a single SELECT command. Thus, an ON SELECT rule effectively turns the table into a view, whose visible contents are the rows returned by the rule’s SELECT command rather than whatever had been stored in the table (if anything). It is considered better style to write a CREATE VIEW command than to create a real table and define an ON SELECT rule for it.

If my query from below isn’t considered a “single select command” (which it may not, considering it uses CTEs), then you could try writing a function that encapsulates my query and having the rule return the select of the function.

CREATE RULE documentation

ORIGINAL POST BELOW – It was added in OP that query is not able to be changed .. so below will not work for OP (but leaving in case others will benefit from it)

Try splitting it up into two separate queries..

WITH max_frame AS (
  SELECT run_id, MAX(frame) AS max_frame FROM run.frames_stat GROUP BY run_id
), max_time AS (
  SELECT run_id, MAX(time) AS max_time FROM run.frames_stat GROUP BY run_id
) SELECT
  a.run_id, a.max_frame, b.max_time
FROM max_frame a
JOIN max_time b ON a.run_id = b.run_id

In terms of indexes .. an index on run_id MAY be enough for both queries .. but if it’s not, try having two indexes … one on (run_id, frame) and one on (run_id, time)

I BELIEVE this will help improve your query – postgres optimizer is probably assuming it will need to scan most of the rows of the table (even though it knows an index is available) because it will need to find the MAX (frame) AND the MAX (time) in one pass.. By splitting it up as I have, it will know that it only needs to find one MAX value (instead of two), so should use the index to find it.

If that doesn’t, then you really need to provide the data requested in the link that @a_horse_with_no_name supplied..

Leave a Reply

Your email address will not be published. Required fields are marked *