Question :
I am struggling with a query performance problem that exceeds my meager DBA skills. I am querying a MySQL database, details below.
Goal
I am trying to get a count of items as filtered by alignments to certain state standards and subjects therein. The filters differ depending on the item banks included in the query. For example in the query below, I am looking for all the Math items in item bank xyz that have at least one alignment to either CA and CC standards and all the Math and Science items in item bank abc that have at least one alignment to GA standards.
Query
SELECT COUNT(DISTINCT item.id)
FROM item
JOIN alignment ON item.id = alignment.item_id
JOIN standard ON standard.guid = alignment.guid
WHERE
((item.item_bank_id = 'xyz'
AND alignment.set_name in ('CA', 'CC')
AND standard.subject IN ('Math'))
OR
(item.item_bank_id = 'abc'
AND alignment.set_name in ('GA')
AND standard.subject IN ('Math','Science'))
)
Item Table
+--+------------+
|id|item_bank_id|
+--+------------+
|1 |abc |
+--+------------+
|2 |xyz |
+--+------------+
Alignment Table
+------------+--------+
|item_id|guid|set_name|
+-------+----+--------+
|1 |efc |CC |
+-------+----+--------+
|1 |8ef |GA |
+-------+----+--------+
|2 |8ef |GA |
+-------+----+--------+
Standard Table
+--+----+--------+-------+
|id|guid|set_name|subject|
+--+----+--------+-------+
|1 |efc |CC |Math |
+--+----+--------+-------+
|2 |8ef |GA |Science|
+--+----+--------+-------+
Plan
+----+-------------+-----------+-------+------------------------------------------------+--------------------+---------+-------------------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+-------+------------------------------------------------+--------------------+---------+-------------------+--------+----------+--------------------------+
| 1 | SIMPLE | item | range | PRIMARY,item_vendor_id_unique,FK317B134288FA87 | FK317B134288FA87 | 113 | NULL | 245642 | 100.00 | Using where; Using index |
| 1 | SIMPLE | alignment | ref | FK695FA1E326027BBA,GUID,INDEX_SET_NAME | FK695FA1E326027BBA | 113 | ia.item.id | 1 | 100.00 | Using where |
| 1 | SIMPLE | standard | ref | guid,SUBJECT_INDEX | guid | 768 | ia.alignment.guid | 1 | 100.00 | Using where |
+----+-------------+-----------+-------+------------------------------------------------+--------------------+---------+-------------------+--------+----------+--------------------------+
- FK317B134288FA87 is a foreign key index on item.item_bank_id
- FK695FA1E326027BBA is a foreign key index on alignment.item_id
A few notes on this:
- There are indices on all columns used in this query, either as primary key (item.id, standard.id), foreign key (alignment.item_id), or a simple index (alignment.guid, alignment.set_name, standard.guid, and standard.subject)
- I have not yet found a suitable compound index that improves performance.
- The reason the link to standard is through the guid column and not the PK is due to the fact that the data in the standard table is regularly wiped out. The guid column values are constant and unique, but because the table is wiped, we can’t use guid as a primary key.
- There are many hundreds of thousands of rows in the item table, ~100K rows in the standard table, and 1.5M rows in the alignment table.
The crux of the issue seems to be alignment table, but I’ve been unsuccessful in adding new indices or reworking the query to make it more performant. Any hints or tips would be most appreciated!
=== UPDATE ===
Based on the advice from one of the answers, I modified the query to noticeably improve performance of the query:
SELECT COUNT(DISTINCT item.id)
FROM item
WHERE
((item.item_bank_id = 'xyz'
AND EXISTS(SELECT guid
FROM alignment
WHERE item_id = item.id
AND set_name in ('CA', 'CC')
AND subject IN ('Math')))
OR
(item.item_bank_id = 'abc'
AND EXISTS(select guid
FROM alignment
WHERE item_id = item.id
AND set_name in ('GA')
AND subject IN ('Math', 'Science'))))
- I copied the subject field to the alignment table (it’s as immutable as the guid itself). This removed a table from the query entirely.
- I re-worked the query to check for the existence of an alignment that meets the criteria. Since I just have to find one instance, I used a subquery to do so instead of a join.
I haven’t been able to figure out any better compound indices to further improve performance, but this gets me to an acceptable level.
Answer :
You are joining first, and once joined, applying WHERE clause.
Try to do in the opposite order: first filter with WHERE conditions and also needed columns from each table (alignment, standard, item), and after that do the joins.
That’s why join operation multiplies row tables increasing number of operations and hence cpu-load when table dimensions increase. So, is better to filter before joining.
You need some composite indexes for that query (I’ve beautify query a little):
SELECT COUNT(DISTINCT item.id)
FROM item AS w
JOIN alignment AS z ON z.item_id = w.id
JOIN standard AS q ON q.guid = z.guid
WHERE (
( w.item_bank_id = 'xyz'
AND z.set_name IN ('CA', 'CC')
AND q.subject IN ('Math')
)
OR
( w.item_bank_id = 'abc'
AND z.set_name IN ('GA')
AND q.subject IN ('Math','Science')
)
)
;
For each table involved you need an index that mentioned all the used columns in the JOIN
clause and in the WHERE
as well:
- The
item
table need the index(id, item_bank_id)
(in case of InnoDB the singleitem_bank_id
is enough) - The
alignment
table need the index(item_id, guid, set_name)
- The
standard
table need the index(guid, subject)
Here I suppose that id
, item_id
and guid
columns contains the unique values therefore proposed indexes have that specific orders of columns. But for other cases the different indexes like (guid, set_name, item_id)
can be more efficient. The actual order of columns in the index depend on the size of table(s), type of columns and the ratio count(DISTINCT)/count(*)
value.
Also for best efficiency you have to avoid literal values and replace them by IDs from the corresponding reference tables, if possible.
Add the indexed below, then see how fast this runs:
SELECT COUNT(DISTINCT id)
FROM
( SELECT i.id
FROM item AS i
JOIN alignment AS a ON i.id = a.item_id
JOIN standard AS s ON s.guid = a.guid
WHERE i.item_bank_id = 'xyz'
AND a.set_name in ('CA', 'CC')
AND s.subject IN ('Math')
) UNION
( SELECT i.id
FROM item AS i
JOIN alignment AS a ON i.id = a.item_id
JOIN standard AS s ON s.guid = a.guid
WHERE i.item_bank_id = 'abc'
AND a.set_name in ('GA')
AND s.subject IN ('Math','Science')
)
item: INDEX(item_bank_id, id), INDEX(id, item_bank_id)
alignment: INDEX(item_id, set_name, guid), INDEX(set_name, guid, item_id)
standard: INDEX(subject, guid), INDEX(guid, subject)