Why is MySQL not using the index with the higher cardinality?

Posted on

Question :

I have an index called source with cardinality of 1122 and a second index called state with cardinality of 22. MySQL keeps using the state index when when query SELECT C1 FROM tbl WHERE source = 'x' and state = 'y'. Is this normal? If not, what could be going wrong?

Answer :

Cardinality

Even with high cardinality, the tipping point used by the MySQL Query Optimizer is either the key distribution or the storage engine.

Back on November 13, 2012, I discussed how lopsided keys can make the Query Optimizer choose different indexes (sometime not choose and index at all) : Must an index cover all selected columns for it to be used for ORDER BY?

In that post, I loaded a table call mf (Male Female) and stored a gender M or F. I inserted 37 Ms and 3Fs. I then ran explain plans on both Male and Female values crossed with MyISAM and InnoDB. Those choices of keys and storage engines yielded some interesting results.

Your Query

What you need to do is examine how distributed (or lopsided) your keys are.

You need to run the following two queries

SELECT IFNULL(source,'Total'),COUNT(1) RowCount
FROM tbl GROUP BY source WITH ROLLUP;

SELECT IFNULL(state,'Total'),COUNT(1) RowCount
FROM tbl GROUP BY state WITH ROLLUP;

This will give you a count of each value along with a total row count.

In my earlier post, I said I use a 5% rule-of-thumb. Look at the rowcounts for each value. If a rowcount for any particular source or state exceed 5% of the table’s rowcount, the Query Optimizer throws one index under the bus and tries another index. In rare cases, it may just a do a full table scan.

If any key with a low rowcount does not yield a good index choice, you may need to recompute the index statistics for the MyISAM table (especially if the MyISAM table experienced lots of INSERTs, UPDATEs, and DELETEs, thus making the index statistics stale). Just run this:

ANALYZE TABLE tbl;

or to defragment the MyISAM table and recompute the index statistics, run

OPTIMIZE TABLE tbl;

Give it a Try !!!

UPDATE 2014-04-12 18:06 EDT

I like the answer posted by Alexandros (+1 for you).

I would like to update and expand my explanation by means of his answer.

You mentioned the following cardinalities

  • state has 22
  • source has 1122

That being the case, you need to run the following

ALTER TABLE tbl DROP INDEX state;
ALTER TABLE tbl ADD INDEX state_source_index (state,source);

Do not reverse the order of the columns. You should always index lower cardinality columns first.

OK, creating that index will make life better for your query. Or, will it ???

After creating the index, please run this query

SELECT
    IF(ISNULL(state)=1,'Total',
    CONCAT('Total Sources for ',state)) Statistic,RowCount
FROM
(
    SELECT state,source COUNT(1) RowCount
    FROM tbl GROUP BY state,source WITH ROLLUP
) A;

This will give you a count of all (state,source) key combinations in the table and subtotals per state. Any RowCount of any key combination that exceeds 5% of the table will most likely result in a table scan.

UPDATE 2014-04-15 16:43 EDT

Your last comment was

I am learning a lot from your answer. Thank you. What strategy would you suggest when when the key and multi-column key are lopsided? I don’t understand why I should index lower cardinality columns first?

The order of the columns will help depending on the type of query you need.

If you run queries like these

  • SELECT ... FROM tbl ORDER BY state,source;
  • SELECT ... FROM tbl WHERE state='NY' ORDER BY source;
  • SELECT source,COUNT(1) RowCount FROM tbl WHERE state='NY' GROUP BY source;

Then, the index should be (state,source)

If you run queries like these

  • SELECT ... FROM tbl ORDER BY source,state;
  • SELECT ... FROM tbl WHERE source='blog' ORDER BY state;
  • SELECT state,COUNT(1) RowCount FROM tbl WHERE source='blog' GROUP BY state;

Then, the index should be (source,state).

I usually line of columns with lower cardinality first to make index scans a little smoother for queries where the lower cardinality is involved in the ORDER BY or GROUP BY along with other columns.

If you are not comfortable with the index being (state,source), you coudl just make both

ALTER TABLE tbl ADD INDEX state_source_index (state,source);
ALTER TABLE tbl ADD INDEX source_state_index (source,state);

Run your EXPLAIN plans on your queries and see which index get used more often.

Whatever key combination are lopsided and whatever order you have the columns in the compound index, evaluating the range and order of data is faster if columns being used are indexed in groups.

UPDATE 2014-04-16 12:17 EDT

If each (source,state) key combination needs to be ordered by date, you need to incorporate the date column in the index as the last column.

ALTER TABLE tbl ADD INDEX state_source_date_index (state,source,date);
ALTER TABLE tbl ADD INDEX source_state_date_index (source,state,date);

That way the ORDER BY does not have to trigger a sort.

If you execute frequently those kind of queries:

SELECT C1 FROM tbl WHERE source = 'x' and state = 'y'

Then it might be advisable to build an additional index on both source and state. Maybe this question Should I create an index for non key columns? might help you to understand this concept.

Leave a Reply

Your email address will not be published. Required fields are marked *