Question :
table_1
id | column_1 | import_date
Indexes: Mutli-column key (multi_column
) of import_date
and column_1
in addition to an individual index for each column.
table_2
id | column_1 | column_2
Indexes: Multi-column key (multi_column
) for column_1
and column_2
in addition to an individual index for each column.
i.e. both tables have two composite indexes (of the same name in this question) as well as separate indexes.
Query
I’ve tried flipping the query around and playing with using/forcing different indexes but it didn’t seem to help.
SELECT table_1.import_date, count(*) AS count
FROM table_1
FORCE INDEX (import_date)
JOIN table_2 USE INDEX (multi_column) ON table_1.column_1 = table_2.column_1 AND column_2 = "Some String"
WHERE table_1.import_date >= '2016-05-26'
GROUP BY table_1.import_date
LIMIT 365
I tried adding in the LIMIT
and a where clause to check where the date was greater than the same date last year just to try and prevent a large number of results and filter but neither seemed to improve the performance of the query, both tables have a large record set.
EXPLAIN
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE table_1 NULL range multi_column,import_date import_date 6 NULL 1932171 100.00 Using index condition; Using where
1 SIMPLE table_2 NULL ref multi_column multi_column 126 table_1.column_1,const 117 100.00 Using index
Sorry for obfuscated table and column names, trying to keep it simple.
Original Explain without force/use:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE table_1 NULL index multi_column,column_1,import_date import_date 69 NULL 365 100.00 Using where; Using index
1 SIMPLE table_2 NULL ref multi_column,column_1,column_2 multi_column 126 table_1.column_1,const 117 100.00 Using index
Row counts for each table are 3,937,236 and 4,503,551 respectively.
The query currently takes around 80s, how long should it take, I’m not sure but I imagined with indexes then hopefully under a second.
Running locally on my development machine just now, Mac OS X El Capitan, 2.2 GHz i7 16gb ram.
MySQL version 5.7.15
Both tables are using InnoDB.
This query seems to take it down to roughly 20 seconds:
select import_date, count(*) as count
from table_1
where exists (select * from table_2 where table_1.column_1 = table_2.column_1 and column_2 = 'Some String')
group by import_date
Answer :
How about FORCE INDEX (import_date)
to FORCE INDEX (multi_column)
?
That EXPLAIN
shoes that it is reading 1,932,171 rows from table_1 which explains it taking a long time. What is the query plan when you don’t give index use hints?[1].
[1] Add this extra info to your question, along with the extra detail others have suggested, rather than in a comment
You may be hitting the issue of it only being able to use one index per table, and needing to use two in this case to be efficient: one to filter on date and one to perform the JOIN
operation efficiently. It may be helpful to the query planner to have an index that covers import_date, column_1
instead of just import_date
.
Slightly modified query
SELECT table_1.import_date, count(*) AS count
FROM table_1
JOIN table_2 ON table_1.column_1 = table_2.column_1
AND table_2.column_2 = "Some String"
WHERE table_1.import_date >= '2016-05-26'
GROUP BY table_1.import_date
ORDER BY table_1.import_date
LIMIT 365
Without an ORDER BY
, you can’t be sure which 365 rows you will get, so I added such.
Analysis without the hints.
table_1
:INDEX(import_date)
will take care of the filtering, grouping, and sorting, thereby leading to only 365 rows oftable_1
needing to be touched.INDEX(import_date, column_1)
, in this order, is an improvement on that by making it “covering”.table_2
:INDEX(column_1, column_2)
, in either order, is then perfect for reachingtable_2
. Furthermore, this index is “covering”EXPLAIN
indicates “covering” by saying “Using index”.- “Using index condition” (alias ICP or Index Condition Pushdown) is a different optimization, often less useful.
Bottom line: That should have been the best way to write the query. Since it did not seem to be, let me ramble on about other possibilities.
How much I/O
If you are using InnoDB, what is the setting of innodb_buffer_pool_size
. For larger machines, it should be about 70% or RAM.
Rule of Thumb: The data in a table or an index is stored in a BTree structure with about 100 ‘rows’ per 16KB leaf-node block.
A Covering index will do all the work in the index. For table_1, that is about 4 blocks to get the needed info. table_2 is messier to compute. It will probably involve 365 probes into the index, together with reading (and counting) some number of rows each.
You cannot get the query to run any faster other than with shrinking datatypes, adding RAM, using a Summary table, etc. — various things that you cannot control from the indexes or the formulation of the query.
Effect of LIMIT, etc
Any of these can change the EXPLAIN
. Please focus on whatever matches your real query:
ORDER BY + LIMIT
leads to a significant optimization wherein it can avoid gathering all the data, then sorting.- Adding columns to the
SELECT
— This is likely to kill the “covering” that I am taking advantage of. This can lead to a significant de-optimization. It would have to bounce between the index BTree and the data BTree, usually more than doubling the work. - If the filtering (
WHERE
) hits more than an estimated 20% of the table, the Optimizer will probably ignore the index and simply do a table scan (to avoid the bouncing). - And many more.
Fair timing
I brought up the buffer_pool_size and the details of the operation, because it sounded like the timings you took may not have been using a “warm” cache. Please run each query twice (with the Query cache turned off or bypassed); take the second timing. It will give you a fair comparison.