PostgreSQL, integer arrays, index for equality

Posted on

Question :

I have a huge list of integer arrays (300,000,000 records) stored in Postgres 9.2 DB. I want to efficiently search these records for an exact match (equality only). I have heard of the intarray module and the corresponding gist-gin indexes. I would like to ask the following questions:

  • Does PostgreSQL uses a hash function for checking equality of integer arrays or does it perform a brute-force algorithm comparing one-by-one the elements of the array?
  • If PostgreSQL uses a hash function, is there some PostgreSQL function code to actually get the hash function result for a particular array?
  • Which index will be best for such a task? B-tree, or the gist – gin indexes provided by the intarray module? The dataset will be static, i.e, once all records are inserted there will no more insertions. So, building index / updating index time is not important to me.

Answer :

1) as you already have discovered, you can’t use b-tree as the index size is bigger than the page size

2) given:

As a rule of thumb, a GIN index is faster to search than a GiST index, but slower to build or update; so GIN is better suited for static data and GiST for often-updated data.

You would have to use GIN. And no, GIN doesn’t use hash functions nor a brute-force algorithm. It’s a reverse index:

A GIN index stores a set of (key, posting list) pairs, where a posting list is a set of row IDs in which the key occurs. The same row ID can appear in multiple posting lists, since an item can contain more than one key. Each key value is stored only once, so a GIN index is very compact for cases where the same key appears many times.

Internally, a GIN index contains a B-tree index constructed over keys, where each key is an element of one or more indexed items (a member of an array, for example)

Q: Does PostgreSQL uses a hash function for checking equality of integer
arrays or does it perform a brute-force algorithm comparing one-by-one
the elements of the array?

Not according to Array Functions and Operators in the doc:

Array comparisons compare the array contents element-by-element, using
the default B-tree comparison function for the element data type

No mention of hashing.

intarray provides other operators but does not replace the equality operator between int[]. The closest function _int_same() that it exposes is semantically different (the order of elements does not matter) and is implemented as sorting+sequential comparison, not hashing.


Fortunately implementing fast hash-based search at the SQL level is not hard, and in your case (large arrays, no updates, exact match) it may even be the most effective method.

Steps:

1) choose a hash function. I’d suggest md5 on the text representation of the array:

create function arr_hash(int[]) returns bytea as
$$ select digest($1::text, 'md5');$$
language sql immutable;

The function digest(text,text) is part of the pgcrypto extension. Compared to md5 it has the advantage of producing binary (16 bytes) instead of hexadecimal (32 bytes) for a leaner index.

2) create a functional index:

create index index_name on table_name(arr_hash(col_name));

It will be several orders of magnitude faster than a GIN index for the kind of dataset you have (actually I would worry about the GIN index creation taking a really unreasonable time, but do try it).

3) use it like this:

select 1 from table_name
 where arr_hash(col_name)=arr_hash('{10,20,30,...lot of values}'::int[])
 and   col_name='{10,20,30,...lot of values}'::int[];

Leave a Reply

Your email address will not be published. Required fields are marked *