Can anyone explain how the LIKE operator is implemented in current database systems (e.g. MySQL or Postgres)? or point me to some references that explain it?
The naive approach would be to inspect each record, executing a regular expression or partial string match on the field of interest, but I have a feeling (hope) that these systems do something smarter.
No, that’s pretty much what they’re doing. Now, if there is not a leading wildcard and the field is indexed, which is the usual situation, the database engine can apply the regular expression to the index. So, for example, if you write
SELECT * FROM employees WHERE last_name LIKE 'Cav%'
the database can use the index on
LAST_NAME to find all the rows where the last name begins ‘Cav’. On the other hand, if you had something like
SELECT * FROM employees WHERE last_name LIKE '%av%'
the database would have to scan the entire table (or the entire index) and evaluate the expression against the full
LAST_NAME value. Obviously, that’s very expensive.
Most of the better relational databases have facilities to do full-text search in a more efficient manner by constructing different sorts of indexes and text catalogs but these don’t use the LIKE keyword. For example, here’s a nice article that discusses full-text search in PostgreSQL.
In addition to what Justin Cave wrote, since PostgreSQL 9.1 you can speed up any search with
~~*), and basic regular expression matches, too (
~). Use the operator classes provided by the module pg_trgm with a GIN or GiST index to speed up
LIKE expressions that are not left-anchored. To install the extension, run once per database:
CREATE EXTENSION pg_trgm;
Create an index of the form
CREATE INDEX tbl_col_gin_trgm_idx ON tbl USING gin (col gin_trgm_ops);
CREATE INDEX tbl_col_gist_trgm_idx ON tbl USING gist (col gist_trgm_ops);
Creating and maintaining a GIN or GiST index carries a cost, but if your table is not heavily written, this is a great feature for you.
GIN or GiST?
These two quotes from the manual should provide some guidance
The choice between GiST and GIN indexing depends on the relative
performance characteristics of GiST and GIN, which are discussed
elsewhere. As a rule of thumb, a GIN index is faster to search than a
GiST index, but slower to build or update; so GIN is better suited for
static data and GiST for often-updated data.
But for “nearest neighbour” type of queries with the using the distance operator
This can be implemented quite efficiently by GiST indexes, but not by
Speaking about MySQL, the position of the wild-card character (%) makes a difference. If the first part of the text is specified like
where first_name like 'Sta%', then the DB engine will search only a smaller subset of words staring with S, then going to St, and then Sta, etc. If you do something like
where first_name like '%stan%', then and entire scan of the column will be required.
You can also look into full-text indexes that also does natural language searches. Check out the MySQL docs here.