Think I have the right indexes and query, but MySQL Query is still slow

Posted on

Question :

table_1

id | column_1 | import_date

Indexes: Mutli-column key (multi_column) of import_date and column_1 in addition to an individual index for each column.

table_2

id | column_1 | column_2

Indexes: Multi-column key (multi_column) for column_1 and column_2 in addition to an individual index for each column.

i.e. both tables have two composite indexes (of the same name in this question) as well as separate indexes.

Query

I’ve tried flipping the query around and playing with using/forcing different indexes but it didn’t seem to help.

SELECT table_1.import_date, count(*) AS count
FROM table_1
FORCE INDEX (import_date)
JOIN table_2 USE INDEX (multi_column) ON table_1.column_1 = table_2.column_1 AND column_2 = "Some String"
WHERE table_1.import_date >= '2016-05-26'
GROUP BY table_1.import_date
LIMIT 365

I tried adding in the LIMIT and a where clause to check where the date was greater than the same date last year just to try and prevent a large number of results and filter but neither seemed to improve the performance of the query, both tables have a large record set.

EXPLAIN

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   SIMPLE  table_1 NULL    range   multi_column,import_date    import_date 6   NULL    1932171 100.00  Using index condition; Using where
1   SIMPLE  table_2 NULL    ref multi_column    multi_column    126 table_1.column_1,const  117 100.00  Using index

Sorry for obfuscated table and column names, trying to keep it simple.

Original Explain without force/use:

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   SIMPLE  table_1 NULL    index   multi_column,column_1,import_date   import_date 69  NULL    365 100.00  Using where; Using index
1   SIMPLE  table_2 NULL    ref multi_column,column_1,column_2  multi_column    126 table_1.column_1,const  117 100.00  Using index

Row counts for each table are 3,937,236 and 4,503,551 respectively.

The query currently takes around 80s, how long should it take, I’m not sure but I imagined with indexes then hopefully under a second.

Running locally on my development machine just now, Mac OS X El Capitan, 2.2 GHz i7 16gb ram.

MySQL version 5.7.15

Both tables are using InnoDB.

This query seems to take it down to roughly 20 seconds:

select import_date, count(*) as count
from table_1
where exists (select * from table_2 where table_1.column_1 = table_2.column_1 and column_2 = 'Some String')
group by import_date

Answer :

How about FORCE INDEX (import_date) to FORCE INDEX (multi_column)?

That EXPLAIN shoes that it is reading 1,932,171 rows from table_1 which explains it taking a long time. What is the query plan when you don’t give index use hints?[1].

[1] Add this extra info to your question, along with the extra detail others have suggested, rather than in a comment

You may be hitting the issue of it only being able to use one index per table, and needing to use two in this case to be efficient: one to filter on date and one to perform the JOIN operation efficiently. It may be helpful to the query planner to have an index that covers import_date, column_1 instead of just import_date.

Slightly modified query

SELECT  table_1.import_date, count(*) AS count
    FROM  table_1
    JOIN  table_2  ON table_1.column_1 = table_2.column_1
                 AND  table_2.column_2 = "Some String"
    WHERE     table_1.import_date >= '2016-05-26'
    GROUP BY  table_1.import_date
    ORDER BY  table_1.import_date
    LIMIT  365

Without an ORDER BY, you can’t be sure which 365 rows you will get, so I added such.

Analysis without the hints.

  • table_1: INDEX(import_date) will take care of the filtering, grouping, and sorting, thereby leading to only 365 rows of table_1 needing to be touched.
  • INDEX(import_date, column_1), in this order, is an improvement on that by making it “covering”.
  • table_2: INDEX(column_1, column_2), in either order, is then perfect for reaching table_2. Furthermore, this index is “covering”
  • EXPLAIN indicates “covering” by saying “Using index”.
  • “Using index condition” (alias ICP or Index Condition Pushdown) is a different optimization, often less useful.

Bottom line: That should have been the best way to write the query. Since it did not seem to be, let me ramble on about other possibilities.

How much I/O

If you are using InnoDB, what is the setting of innodb_buffer_pool_size. For larger machines, it should be about 70% or RAM.

Rule of Thumb: The data in a table or an index is stored in a BTree structure with about 100 ‘rows’ per 16KB leaf-node block.

A Covering index will do all the work in the index. For table_1, that is about 4 blocks to get the needed info. table_2 is messier to compute. It will probably involve 365 probes into the index, together with reading (and counting) some number of rows each.

You cannot get the query to run any faster other than with shrinking datatypes, adding RAM, using a Summary table, etc. — various things that you cannot control from the indexes or the formulation of the query.

Effect of LIMIT, etc

Any of these can change the EXPLAIN. Please focus on whatever matches your real query:

  • ORDER BY + LIMIT leads to a significant optimization wherein it can avoid gathering all the data, then sorting.
  • Adding columns to the SELECT — This is likely to kill the “covering” that I am taking advantage of. This can lead to a significant de-optimization. It would have to bounce between the index BTree and the data BTree, usually more than doubling the work.
  • If the filtering (WHERE) hits more than an estimated 20% of the table, the Optimizer will probably ignore the index and simply do a table scan (to avoid the bouncing).
  • And many more.

Fair timing

I brought up the buffer_pool_size and the details of the operation, because it sounded like the timings you took may not have been using a “warm” cache. Please run each query twice (with the Query cache turned off or bypassed); take the second timing. It will give you a fair comparison.

Leave a Reply

Your email address will not be published. Required fields are marked *