Optimize query matching first N items of an array

Posted on

Question :

I’m working with a Postgres DB full of chess game data right now, where each game is a row in a ‘records’ table. The players’ moves and the (optional) computer evaluations of those moves each have their own column and are stored as arrays.

I’ve written a query to retrieve all evaluations of a specified opening sequence of moves. (You’d think the computer evaluation would be consistent – but it’s not.) The length of the opening sequence is arbitrary – it could be one move, it could be thirty.

Here’s an example query, which finds all games starting with the same ten-move opening sequence and then, for each game with an evaluation, returns the computer’s evaluation of that point in the game –

SELECT evaluation[10]
FROM records
WHERE moves[1:10]::text[] = ARRAY['b4', 'e5', 'Bb2', 'd6', 'Nf3', 'Nf6', 'g3', 'Bg4', 'Bg2', 'h5']::text[]
AND evaluation IS NOT NULL;

I’m not certain it’s relevant, but the move data is always an alphanumeric string from 2-6 characters, and the computer evaluations are largely decimals (both positive and negative) but do include the occasional special character (forced checkmates have an octothorpe for a prefix).

Here’s the relevant snippet of the table description –

     Column      |              Type              | 
-----------------+--------------------------------+-
 id              | bigint                         | 
 moves           | character varying(255)[]       |  
 evaluation      | character varying(255)[]       | 
    "records_pkey" PRIMARY KEY, btree (id)
Access method: heap

And here’s the query plan from EXPLAIN ANALYZE:

 Gather  (cost=1000.00..736354.70 rows=905 width=516) (actual time=28251.267..28253.139 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on records  (cost=0.00..735264.20 rows=377 width=516) (actual time=28243.233..28243.234 rows=0 loops=3)
         Filter: ((evaluation IS NOT NULL) AND ((moves[1:10])::text[] = '{b4,e5,Bb2,d6,Nf3,Nf6,g3,Bg4,Bg2,h5}'::text[]))
         Rows Removed by Filter: 971361
 Planning Time: 8.275 ms
 Execution Time: 28253.915 ms

This is too slow, but I don’t know how to go about optimizing it — I’m far from an expert when it comes to Postgres and all my attempts to set up an index haven’t budged the query plan a bit.

Some additional thoughts

Since my opening sequences always start from the beginning of the game — I might want to match on moves 1 to 3 or on moves 1 to 30, but never, say, from move 7 to 15 — I’ve also thought about also storing the move data as a space-delimited text string, and matching against the start of the string. (I don’t know how to optimize that query, either, but perhaps that would be easier.)

While these are currently arrays of strings, I could possibly represent both the moves and the evaluations as arrays of integers. (Not that I want to, but optimizing this query is more important, and if it helps, I’ll do it.)

What do you think? Where should I start?

Answer :

You need index support to be fast. Tricky without re-designing your table. The following solution should perform excellently (microseconds instead of seconds), but requires some skill. Buckle up.

Expression index on IMMUTABLE function

Just take a few leading array elements, say 8. That should be very selective already. More would just make the index bigger, for little extra filtering.

Convert to a string. No separator. That allows false positives, but unlikely enough to matter. Eventually, we filter for exact results anyway.

Only IMMUTABLE functions are allowed in index expressions. But array_to_string() is only STABLE, not IMMUTABLE, because it takes anyarray and some element types don’t have an immutable text representation. We only deal with text (well, varchar(255) for no good reason, but all the same), and that is, in fact, immutable. But array_to_string() doesn’t know that.

So we could “fake” an immutable wrapper function. That’s possible for any user who can create functions:

CREATE OR REPLACE FUNCTION public.f_8moves(text[])
  RETURNS text
  LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT array_to_string($1[1:8], '')
$func$;

Or, better yet, define an honestly IMMUTABLE variant of array_to_string() for only text[] input, directly based on the underlying C function. Faster and cleaner. Let’s call it array_to_string_immutable() to be clear. That requires superuser privileges:

-- SET ROLE postgres;  -- you must be superuser

CREATE OR REPLACE FUNCTION public.array_to_string_immutable(text[], text)
 RETURNS text
 LANGUAGE internal IMMUTABLE PARALLEL SAFE STRICT AS 'array_to_text';

The rest works without superuser privileges.

CREATE OR REPLACE FUNCTION public.f_8moves(text[])
  RETURNS text
  LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.array_to_string_immutable($1[1:8], '')
$func$;

Related:

Either way, we now have a function public.f_8moves(text[]) to be used in the following index:

CREATE INDEX records_8moves_idx ON records (public.f_8moves(moves) COLLATE "C");

COLLATE "C" is exactly what we need to allow left-anchored (your expressed requirement) LIKE expressions. See:

If a major percentage of rows has evaluation IS NOT NULL, add that filter to the index, to make it a partial index on top:

CREATE INDEX records_8moves_idx ON records (public.f_8moves(moves) COLLATE "C")
WHERE evaluation IS NOT NULL;

Query

For queries with 10 array elements or more:

SELECT evaluation[10]
FROM   records
WHERE  public.f_8moves(moves) = public.f_8moves('{b4,e5,Bb2,d6,Nf3,Nf6,g3,Bg4,Bg2,h5}') COLLATE "C"
AND    moves[1:10] = '{b4,e5,Bb2,d6,Nf3,Nf6,g3,Bg4,Bg2,h5}'
AND    evaluation IS NOT NULL;

COLLATE "C" is required to match the index.

Use the expression public.f_8moves(moves) to match the functional index. The right side of the expression can be given as string directly. But I use the same function for convenience.

Then add original exact filters to narrow the results down to exact matches:

AND    moves[1:10] = '{b4,e5,Bb2,d6,Nf3,Nf6,g3,Bg4,Bg2,h5}'

Looks redundant, is logically redundant, but allows to use the index to great effect.

For arrays with less than 8 elements (our indexed lead), or generally, use LIKE with a left-anchored pattern:

SELECT evaluation[10]
FROM   records
WHERE  public.f_8moves(moves) LIKE (public.f_8moves('{b4,e5}') || '%') COLLATE "C"  -- COLLATE "C" is optional for LIKE
AND    moves[1:2] = '{b4,e5}'
AND    evaluation IS NOT NULL;

db<>fiddle here

Similar, with more explanation:

Cluster table rows

It will generally help performance to physically sort your table rows, so that each query can read results from one or few data pages, instead of fetching from all over the place.

Use CLUSTER or the non-blocking alternatives pg_repack or pg_squeeze. More in the link above.

Leave a Reply

Your email address will not be published. Required fields are marked *