Question :
I’m working with a Postgres DB full of chess game data right now, where each game is a row in a ‘records’ table. The players’ moves and the (optional) computer evaluations of those moves each have their own column and are stored as arrays.
I’ve written a query to retrieve all evaluations of a specified opening sequence of moves. (You’d think the computer evaluation would be consistent – but it’s not.) The length of the opening sequence is arbitrary – it could be one move, it could be thirty.
Here’s an example query, which finds all games starting with the same ten-move opening sequence and then, for each game with an evaluation, returns the computer’s evaluation of that point in the game –
SELECT evaluation[10]
FROM records
WHERE moves[1:10]::text[] = ARRAY['b4', 'e5', 'Bb2', 'd6', 'Nf3', 'Nf6', 'g3', 'Bg4', 'Bg2', 'h5']::text[]
AND evaluation IS NOT NULL;
I’m not certain it’s relevant, but the move data is always an alphanumeric string from 2-6 characters, and the computer evaluations are largely decimals (both positive and negative) but do include the occasional special character (forced checkmates have an octothorpe for a prefix).
Here’s the relevant snippet of the table description –
Column | Type |
-----------------+--------------------------------+-
id | bigint |
moves | character varying(255)[] |
evaluation | character varying(255)[] |
"records_pkey" PRIMARY KEY, btree (id)
Access method: heap
And here’s the query plan from EXPLAIN ANALYZE:
Gather (cost=1000.00..736354.70 rows=905 width=516) (actual time=28251.267..28253.139 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on records (cost=0.00..735264.20 rows=377 width=516) (actual time=28243.233..28243.234 rows=0 loops=3)
Filter: ((evaluation IS NOT NULL) AND ((moves[1:10])::text[] = '{b4,e5,Bb2,d6,Nf3,Nf6,g3,Bg4,Bg2,h5}'::text[]))
Rows Removed by Filter: 971361
Planning Time: 8.275 ms
Execution Time: 28253.915 ms
This is too slow, but I don’t know how to go about optimizing it — I’m far from an expert when it comes to Postgres and all my attempts to set up an index haven’t budged the query plan a bit.
Some additional thoughts
Since my opening sequences always start from the beginning of the game — I might want to match on moves 1 to 3 or on moves 1 to 30, but never, say, from move 7 to 15 — I’ve also thought about also storing the move data as a space-delimited text string, and matching against the start of the string. (I don’t know how to optimize that query, either, but perhaps that would be easier.)
While these are currently arrays of strings, I could possibly represent both the moves and the evaluations as arrays of integers. (Not that I want to, but optimizing this query is more important, and if it helps, I’ll do it.)
What do you think? Where should I start?
Answer :
You need index support to be fast. Tricky without re-designing your table. The following solution should perform excellently (microseconds instead of seconds), but requires some skill. Buckle up.
Expression index on IMMUTABLE
function
Just take a few leading array elements, say 8. That should be very selective already. More would just make the index bigger, for little extra filtering.
Convert to a string. No separator. That allows false positives, but unlikely enough to matter. Eventually, we filter for exact results anyway.
Only IMMUTABLE
functions are allowed in index expressions. But array_to_string()
is only STABLE
, not IMMUTABLE
, because it takes anyarray
and some element types don’t have an immutable text representation. We only deal with text
(well, varchar(255)
for no good reason, but all the same), and that is, in fact, immutable. But array_to_string()
doesn’t know that.
So we could “fake” an immutable wrapper function. That’s possible for any user who can create functions:
CREATE OR REPLACE FUNCTION public.f_8moves(text[])
RETURNS text
LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT array_to_string($1[1:8], '')
$func$;
Or, better yet, define an honestly IMMUTABLE
variant of array_to_string()
for only text[]
input, directly based on the underlying C function. Faster and cleaner. Let’s call it array_to_string_immutable()
to be clear. That requires superuser privileges:
-- SET ROLE postgres; -- you must be superuser
CREATE OR REPLACE FUNCTION public.array_to_string_immutable(text[], text)
RETURNS text
LANGUAGE internal IMMUTABLE PARALLEL SAFE STRICT AS 'array_to_text';
The rest works without superuser privileges.
CREATE OR REPLACE FUNCTION public.f_8moves(text[])
RETURNS text
LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.array_to_string_immutable($1[1:8], '')
$func$;
Related:
Either way, we now have a function public.f_8moves(text[])
to be used in the following index:
CREATE INDEX records_8moves_idx ON records (public.f_8moves(moves) COLLATE "C");
COLLATE "C"
is exactly what we need to allow left-anchored (your expressed requirement) LIKE
expressions. See:
- Is there a difference between text_pattern_ops and COLLATE “C”?
- Is unique index better than unique constraint when an index with an operator class is required
If a major percentage of rows has evaluation IS NOT NULL
, add that filter to the index, to make it a partial index on top:
CREATE INDEX records_8moves_idx ON records (public.f_8moves(moves) COLLATE "C")
WHERE evaluation IS NOT NULL;
Query
For queries with 10 array elements or more:
SELECT evaluation[10]
FROM records
WHERE public.f_8moves(moves) = public.f_8moves('{b4,e5,Bb2,d6,Nf3,Nf6,g3,Bg4,Bg2,h5}') COLLATE "C"
AND moves[1:10] = '{b4,e5,Bb2,d6,Nf3,Nf6,g3,Bg4,Bg2,h5}'
AND evaluation IS NOT NULL;
COLLATE "C"
is required to match the index.
Use the expression public.f_8moves(moves)
to match the functional index. The right side of the expression can be given as string directly. But I use the same function for convenience.
Then add original exact filters to narrow the results down to exact matches:
AND moves[1:10] = '{b4,e5,Bb2,d6,Nf3,Nf6,g3,Bg4,Bg2,h5}'
Looks redundant, is logically redundant, but allows to use the index to great effect.
For arrays with less than 8 elements (our indexed lead), or generally, use LIKE
with a left-anchored pattern:
SELECT evaluation[10]
FROM records
WHERE public.f_8moves(moves) LIKE (public.f_8moves('{b4,e5}') || '%') COLLATE "C" -- COLLATE "C" is optional for LIKE
AND moves[1:2] = '{b4,e5}'
AND evaluation IS NOT NULL;
db<>fiddle here
Similar, with more explanation:
Cluster table rows
It will generally help performance to physically sort your table rows, so that each query can read results from one or few data pages, instead of fetching from all over the place.
Use CLUSTER
or the non-blocking alternatives pg_repack
or pg_squeeze
. More in the link above.