Postgresql version 9.2.3!
I’m working on a database for mapping of chemical names. My main table contains aprox 91 million records and it’s indexed by gin. I want to query it with multiple names (I’m trying now with 100 entries), which I first put in a query table, create a tsquery column of the names and index it with gist.
Main table structure:
CREATE TABLE IF NOT EXISTS pubchem_compounds_synonyms_all ( cid int NOT NULL references pubchem_compounds_index(cid) ON UPDATE CASCADE ON DELETE CASCADE, name VARCHAR(2000) NOT NULL, synonym VARCHAR(2000) NOT NULL, PRIMARY KEY (cid, name, synonym) ); CREATE TEMPORARY TABLE synonyms_temp AS SELECT * FROM pubchem_compounds_synonyms_all; COPY synonyms_temp FROM '...' delimiters E't'; INSERT INTO pubchem_compounds_synonyms_all SELECT DISTINCT * FROM synonyms_temp;
I was trying different approaches, thus for testing gin index I created a clone:
CREATE TABLE synonyms_all_gin_tsvcolumn AS SELECT * FROM pubchem_compounds_synonyms_all;
ALTER TABLE synonyms_all_gin_tsvcolumn ADD COLUMN tsv_syns TSVECTOR; UPDATE synonyms_all_gin_tsvcolumn SET tsv_syns = to_tsvector(synonym); CREATE INDEX gin_tsv_colum_index ON synonyms_all_gin_tsvcolumn USING GIN (tsv_syns);
The query table is:
CREATE TABLE IF NOT EXISTS cmap_names_query ( name VARCHAR (2000) PRIMARY KEY NOT NULL );
Same as in the main table, I fill it in with COPY from via a temp table and then I add the tsquery column:
..... ALTER TABLE cmap_names_query ADD COLUMN lexemes TSQUERY; UPDATE cmap_names_query SET lexemes = plainto_tsquery(name); CREATE INDEX cmap_gist_tsquery_index ON cmap_names_query USING gist (lexemes tsquery_ops);
The query is basically a join between both tables:
SELECT DISTINCT ON (c.lexemes) c.name, s.cid, s.synonym, ts_rank(s.tsv_syns,c.lexemes) FROM synonyms_all_gin_tsvcolumn s JOIN cmap_names_query c ON c.lexemes @@ s.tsv_syns;
lexemes is the gist indexed tsquery column on my query table, whereas tsv_syns is the gin indexed tsvector column in the main names table, the one with 91 million records.
The query is intended to match names, exact matches if possible. It works very well for such a large table. Normal names, containing only characters, can be retreated even in microseconds. The problem is when the string names contains numbers. The operation tsvector and tsquery create one token for each number, and all together makes the query for this sort of entries rather slow, well, slower. Instead of a few milliseconds, they take aprox 1-2 seconds each. I would like to reduce this query time to a few milliseconds like the other entries, but I don’t know how. I have tested it with and without ts_rank to find out that ranking only add half a second to the total query, if it even makes a difference. so that’s not my problem
Some samples queries are:
Limit (cost=32.13..36.14 rows=1 width=92) (actual time=40.001..40.002 rows=1 loops=1) Buffers: shared hit=201 -> Bitmap Heap Scan on synonyms_all_gin_tsvcolumn (cost=32.13..100.24 rows=17 width=92) (actual time=39.998..39.998 rows=1 loops=1) Recheck Cond: (tsv_syns @@ plainto_tsquery('10-methoxyharmalan'::text)) Buffers: shared hit=201 -> Bitmap Index Scan on gin_tsv_colum_index (cost=0.00..32.13 rows=17 width=0) (actual time=39.984..39.984 rows=1 loops=1) Index Cond: (tsv_syns @@ plainto_tsquery('10-methoxyharmalan'::text)) Buffers: shared hit=200 Total runtime: 40.037 ms
result: (cid |name|synonym|tsv vector)
5474706 | 6-Methoxyharmalan | 10-Methoxyharmalan | '10':1 'methoxyharmalan':2
Limit (cost=32.23..36.23 rows=1 width=92) (actual time=2215.127..2215.128 rows=1 loops=1) Buffers: shared hit=10086 -> Bitmap Heap Scan on synonyms_all_gin_tsvcolumn (cost=32.23..148.34 rows=29 width=92) (actual time=2215.125..2215.125 rows=1 loops=1) Recheck Cond: (tsv_syns @@ plainto_tsquery('1,4-chrysenequinone'::text)) Buffers: shared hit=10086 -> Bitmap Index Scan on gin_tsv_colum_index (cost=0.00..32.22 rows=29 width=0) (actual time=2215.108..2215.108 rows=1 loops=1) Index Cond: (tsv_syns @@ plainto_tsquery('1,4-chrysenequinone'::text)) Buffers: shared hit=10085 Total runtime: 2215.182 ms
180933 | 1,4-Chrysenedione | 1,4-Chrysenequinone | '1':1 '4':2 'chrysenequinon':3
Limit (cost=32.00..36.02 rows=1 width=92) (actual time=2852.934..2852.936 rows=1 loops=1) Buffers: shared hit=7292 -> Bitmap Heap Scan on synonyms_all_gin_tsvcolumn (cost=32.00..36.02 rows=1 width=92) (actual time=2852.930..2852.930 rows=1 loops=1) Recheck Cond: (tsv_syns @@ plainto_tsquery('2-deoxy-D-glucose'::text)) Buffers: shared hit=7292 -> Bitmap Index Scan on gin_tsv_colum_index (cost=0.00..32.00 rows=1 width=0) (actual time=2852.897..2852.897 rows=121 loops=1) Index Cond: (tsv_syns @@ plainto_tsquery('2-deoxy-D-glucose'::text)) Buffers: shared hit=7291 Total runtime: 2852.992 ms
40 | 2-deoxy-D-glucose | 2-deoxy-D-glucose | '2':1 'd':4 'deoxi':3 'deoxy-d-glucos':2 'glucos':5
I wonder what the best way to make this last queries faster would be. I have tried with a pre-processing script that removes all the numbers, it speeds up the search up to 3 seconds in total, but I miss the exact/closest match that I was looking for in some of the cases, so that’s no use. Other approaches that came to mind where:
Sol 1: assign priority weight to word tokens in contrast to numeric ones
I think this could be a potential good solution for me, but as far as I have seen cannot be done. Tsvectors/queries can be labelled, but not token types or IS THERE A WAY TO LABEL TOKENS DIFFERENTLY WITHIN THE SAME TSVECTOR?
Sol 2: create a new configuration and delete numeric mappings with ALTER FULLTEXTMAPPING or add numbers and characters like - ( ) to the stopword list
Same as the parser, it might lead me to wrong matches, although since it keeps the positional information it me perform good. I’m not sure how i should do this though.
My postgres.conf parameters:
shared_buffers = 24GB checkpoint_segments=512 effective_cache_size=72GB work_mem = 1GB system specif. are: 4 x Intel X7550 (@2.00GHz) (total 32 Cores), 256GB Memory, CentOS 5.6 (64 bit), 18TB local disk (SATA-RAID) kernel.shmmax=64GB kernel.shmall=4GB
I have tried lower amounts of shared_buffer and effective_cache_size (16GB and 32GB respectively), no difference in performance from the current one, so I’m planing to change it back to those limits
I tried a gist index on querytree lexemes, didn’t make much difference
I’m a little bit lost and I would appreciate any ideas or possible solutions to speed up my queries.
PD: Any recommendations for nonSQL DBs that could improve performance?
Your ideal solution would probably be to write a custom, domain-data specific tsearch parser for your data, so it produced an ideal
tsquery for the input.
Alternatively, your search problem may be better suited to external tools like Apache Solr, which can work quite happily in conjunction with Pg.
When it comes to the “NoSQL” stuff (which NoSQL? Graph/tree? Key/value store? Object? Document? …) it all tends to come down to details of your data model and how willing you are to write your own distributed search algorithms. You can often gain significant scale-out performance at the cost of considerably greater implementation effort.