Best approach to apply a long list of LIKE patterns?

Posted on

Question :

As a follow-up to this question, I have one of my own.

The original question concerns the use of a CASE statement with > than 100 options and the statement has to be used in 4 places – so obviously the SQL is pretty hairy. The OP’s question concerned SQL Server 2012, my question, however, is about PostgreSQL.

In my answer, I proposed the use of a VIEW as a “one-stop-shop” solution – i.e. declare the VIEW once, use it anywhere – and this applies for any query in the future also and any variant thereof.

Another poster (@AndriyM) proposed the use of a CROSS APPLY to resolve the issue which is another solution. The PostgreSQL syntax is JOIN LATERAL

I then added a CTE (Common Table Expression) to my original answer as yet another possible solution.

So, the OP now has 5 options:

  1. CASE
  2. VIEW
  3. JOIN LATERAL (CROSS APPLY for SQL Server)
  4. CTE
  5. Separate table

I excluded the option of changing the underlying data since, frequently in this forum, consultants/DBA’s/programmers are not allowed change underlying data – makes the answers more interesting also!

Obviously, a CASE expression with > 100 options (x4) is horribly cumbersome and complex – but when is it a good idea to use CASE and at what point does it become a minus rather than a plus?

In my opinion (and not just because it’s my answer!), a VIEW is the optimal solution – it’s simple, will work for all RDBMS’s and is permanent and will work for all queries now and into the future should the OP wish to modify the query.

The JOIN LATERAL construct will work also as a kind of derived table, which is pretty much what a CTE is also. They can both be used down the line in the same query.

Which of the 5 approaches is better/best and at what point does the technical (ease of use, speed, query plan optimisation) tilt in the particular solution’s favour?

Answer :

I would use a translation table in a LATERAL subquery. Demo (Postgres 10+):

CREATE TABLE ac_translate (
   ord_nr int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
 , like_pattern text NOT NULL
 , target text NOT NULL
);

INSERT INTO ac_translate(like_pattern, target) VALUES 
   ('AIR NEW Z%'       , 'AIR NEW ZEALAND')  -- rows in order of precedence!
 , ('AIR BP%'          , 'AIR BP')
 , ('ADDICTION ADVICE%', 'ADDICTION ADVICE')
 , ('AIA%'             , 'AIA')
;

See:

Manipulate ord_nr to adjust priorities.

Query:

SELECT COALESCE(act.target, ac.accountName) AS accountname
     , SUM(ac.charge_amount) AS gstexcl
FROM   account_code ac
LEFT   JOIN LATERAL (
   SELECT a1.target
   FROM   ac_translate a1
   WHERE  ac.accountname LIKE a1.like_pattern
   ORDER  BY a1.ord_nr
   LIMIT  1
   ) act ON true
GROUP BY 1;

Or with a correlated subquery:

SELECT COALESCE(
        (SELECT a1.target
         FROM   ac_translate a1
         WHERE  ac.accountname LIKE a1.like_pattern
         ORDER  BY a1.ord_nr
         LIMIT  1), ac.accountName) AS accountname
     , SUM(ac.charge_amount) AS sum_amount
FROM   account_code ac
GROUP BY 1;

This is easy to handle, keeps the long list of options out of the code and puts it into a table where it can be maintained properly. And is moderately fast.

We can’t easily use a plain LEFT JOIN ac_translate since CASE traverses patterns in order to return the single, first match. We can’t just join to a set, that might return multiple matches if one pattern is the prefix of the other, like ‘AIR%’ and ‘AIR N%’. So we use an ordering number in the translation table to prioritize matches in the subquery.

The ELSE clause in the referenced question resolves to the original value. That’s implemented with COALESCE here. Basically, this combines the virtues of the top two answers over there.

To top it off, I slipped in GROUP BY 1 as another way to avoid repeating lengthy expressions (which is not actually needed here any more). See:

Speed

Performance deteriorates with the number of rows in the translation table since Postgres is forced to walk through all of them sequentially and evaluate the LIKE expression. If that’s not fast enough any more, we need index support, but the expression is not “sargable” – the expression we need to index is to the right of the operator and there is no COMMUTATOR for LIKE. Details:

There is a workaround, though. My example requires that patterns have at least 3 leading characters (3 is my arbitrary pick). Add a CHECK constraint in the translation table to enforce this rule, and an expression index on the leading trigram:

CREATE INDEX ac_translate_left_idx ON ac_translate (left(like_pattern, 3));

Adapt the query:

SELECT COALESCE(act.target, ac.accountName) AS accountname
     , SUM(ac.charge_amount) AS gstexcl
FROM   account_code ac
LEFT   JOIN LATERAL (
   SELECT a1.target
   FROM   ac_translate a1
   WHERE  left(ac.accountname, 3) = left(a1.like_pattern, 3)
   AND    ac.accountname LIKE a1.like_pattern
   ORDER  BY a1.ord_nr
   LIMIT  1
   ) act ON true
GROUP BY 1;

With enough rows in the translation table (and favorable estimates and cost settings) Postgres will use a very fast index scan to narrow it down to the few candidates (if any) and only filter the rest with the LIKE expression. Should scale just fine. I added the EXPLAIN output to the fiddle as proof of concept:

db<>fiddle here

Leave a Reply

Your email address will not be published. Required fields are marked *