Question :
I have a table (civicrm_contact) with the following (relevant) columns:
+--------------------------------+------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------------+------------------+------+-----+-------------------+-----------------------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| contact_type | varchar(64) | YES | MUL | NULL | |
| first_name | varchar(64) | YES | | NULL | |
| last_name | varchar(64) | YES | MUL | NULL | |
I’m looking to optimize a query that does a partial match comparison of the first/last name fields.
SELECT t1.id id1, t2.id id2, 2 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
ON (SUBSTR(t1.first_name, 1, 4) = SUBSTR(t2.first_name, 1, 4))
AND (SUBSTR(t1.last_name, 1, 6) = SUBSTR(t2.last_name, 1, 6))
WHERE t1.contact_type = 'Individual'
AND t2.contact_type = 'Individual'
AND t1.first_name IS NOT NULL
AND t1.first_name <> ''
AND t1.last_name IS NOT NULL
AND t1.last_name <> ''
AND t1.id < t2.id
I created an index to match the partial matches. There are other existing indexes as well.
CREATE INDEX idx_ct_last6_first4_name ON civicrm_contact (contact_type, last_name(6), first_name(4));
However, the query remains very slow, and when I EXPLAIN it, I can see that the first table does not use that index and the second table does not use any index.
+----+-------------+-------+------------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------+-------+-------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------+-------+-------+----------+----------------------------------------------------+
| 1 | SIMPLE | t1 | NULL | ref | PRIMARY,index_contact_type,index_first_name,index_last_name,dedupe_index_first_name_4,dedupe_index_last_name_6,idx_last6_first4_name,idx_ct_last6_first4_name | index_contact_type | 195 | const | 46968 | 25.00 | Using where |
| 1 | SIMPLE | t2 | NULL | ALL | PRIMARY,index_contact_type,idx_ct_last6_first4_name | NULL | NULL | NULL | 93936 | 16.66 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------+-------+-------+----------+----------------------------------------------------+
I’m trying to understand why it’s not making better use of the indexes and what I can do to improve/optimize the query. It is intended to do a simple comparison against itself to identify duplicate values. Is the lack of index use because I’m joining the table against itself?
If so, I’m guessing my best option is to create a temp table for the second table reference so that I am joining two separate tables, which should make better use of the indexes.
Answer :
I think it cannot pick up the index because of the substr
expression. It isn’t smart enough to get, that this would exactly match the indexed keys. You’ll have the same problems with a temporary table. The problem is not, that you’re self joining civicrm_contact
.
If you’re on version 8.0 you could use stored generated columns, that generate the substrings.
ALTER TABLE civicrm_contact
ADD (last_name_6 varchar(6)
AS (substr(last_name, 1, 6))
STORED,
first_name_4 varchar(4)
AS (substr(first_name, 1, 4))
STORED);
Stored generated columns can be indexed and you can then use these columns instead of the original ones in the substr()
.
For 5.7 to support generated columns you’d need to use NDB Cluster (but I don’t think, that’s the case). Versions below 5.7 don’t support it at all. However, as a general workaround for lower versions than 8.0, you could also add regular columns and fill them with the substrings initially with an UPDATE
and subsequently with a trigger.
Then, for further improvement I’d try to move contact_type
to the end of the index as I guess it’s selectivity is comparatively low.
CREATE INDEX idx_last6_first4_name_ct
ON civicrm_contact (last_name_6,
first_name_4,
contact_type);
(Assuming, that last_name_6
and first_name_4
are the generated columns from above.)
Possibly also try to include id
at the very end as this is a ranged condition rather than a pointed one.
CREATE INDEX idx_last6_first4_name_ct_id
ON civicrm_contact (last_name_6,
first_name_4,
contact_type,
id);
Or maybe play a little more with the order of the columns and see if there’s anything better.
However one question is: Is this a one-off thing or do you plan to run this query frequently? If it’s just a one-off, the effort might outweigh the gain. Maybe just running the query and letting it take it’s time, is the overall more “efficient” way (in terms of human work) in such a case.
(contact_type, last_name(6), first_name(4))
— Don’t use prefix indexing; it almost never lets the index be used.- Don’t hide indexed columns in functions (
substr
, in your example). One workaround requires a newer version (see stickybit’s answer). Another workaround may be possible using ranges instead of equality. NULL
and<>
— Pick one; the test for both hurts performance, possibly preventing use of an index. A possible workaround:x > ''
will fail for botheNULL
and<>
, which seems to be what you desire.- Even if you get the index to be used, you have a scaling problem with
AND t1.id < t2.id
— this sounds like a quadratic (Order(N*N)) algorithm? - If this is a one-time exercise, create another table with the substrings in it, plus getting rid of
NULL
and blank entries. Also limit it to “Individuals”. Now theWHERE
is much simplified. HavePRIMARY KEY(first_prefix, last_prefix, id), INDEX(id)
for added performance. After that, we can help you further optimize the query.