Is there a way to speed up JOIN ON LIKE?

Posted on

Question :

I have millions of things in a table. I’d like to assign groups to the things based on theirs prefixes.

I tried to add all the indexes but the planner still does nested loop left join.

A demo case SQL:

create table things as
    i as id, 
    left(md5(random()::text), 8) as name
from generate_series(1, 100000) as i;

create table match_group_rules as
    i as id,
    trunc(random() * 5 + 1) as group_id,
    left(md5(random()::text), 2) as rule
from generate_series(1, 100) as i;

create extension if not exists pg_trgm;
create index match_group_rules_rule on match_group_rules (rule);
create index match_group_rules_rule_pattern on match_group_rules (rule text_pattern_ops);
create index things_name_idx on things (name);
create index things_name_pattern_idx on things (name text_pattern_ops);
create index things_name_gin_trgm_idx on things using gin (name gin_trgm_ops);
create index things_name_gist_trgm_idx on things using gist (name gist_trgm_ops);

select *
from things t
left join match_group_rules r 
    on like r.rule || '%';

The demo case planner output:

Nested Loop Left Join  (cost=0.00..176543.25 rows=100000 width=57)
  Join Filter: ( ~~ (r.rule || '%'::text))
  ->  Seq Scan on things t  (cost=0.00..1541.00 rows=100000 width=13)
  ->  Materialize  (cost=0.00..2.50 rows=100 width=44)
        ->  Seq Scan on match_group_rules r  (cost=0.00..2.00 rows=100 width=44)


  1. What makes the planner ignore the indexes?
  2. Is there a quick way to compute the groups for the things?

Answer :

The initial problem is the left join. That restricts it to using things as the driving table, but once you do that there is no way to use the indexes, as pg_trgm and text_pattern_ops offer ways to index for texts matching a given pattern, not for patterns matching a given text.

In order to use an index on the left join, you would have to first do a nested loop with match_group_rules driving, memorize all the rows of things retrieved, and then go back and make up NULL rows for the things not found. There is no inherent reason that I can see that PostgreSQL wouldn’t be able to do this complex operation, it just isn’t implemented. You can do it yourself with a CTE and a UNION ALL:

with foobar as 
(select as t_id, name,, r.group_id, r.rule
   from things t
   join match_group_rules r 
     on like r.rule || '%'
select * from foobar 
    union all
select id, name, NULL, NULL, NULL from things t2 where not exists 
 (select 1 from foobar where;

Leave a Reply

Your email address will not be published. Required fields are marked *