I’m using MySQL 5.6 and my storage engine is InnoDB.
I have a table with 1 million rows containing with the columns:
- ID (primary key)
- foreign_key_id (foreign key, NOT NULL)
- foreign_key_id2 (another foreign key, default NULL)
The rows are seperated under:
- 25% with foreign_key_id value 1 and foreign_key_id2 NULL
- 25% with foreign_key_id value 1 and foreign_key_id2 NOT NULL
- 25% with foreign_key_id value 2 and foreign_key_id2 NULL
- 25% with foreign_key_id value 2 and foreign_key_id2 NOT NULL
With the following indexes:
- index foreign_key_idx on foreign_key_id
- index foreign_key_2_idx on and foreign_key_id2
- composite index foreign_key_comp_idx on (foreign_key_idx, foreign_key_2_idx)
I perform the following queries:
Query 1 – without indexes:
SELECT * FROM table tbl IGNORE INDEX(foreign_key_idx, foreign_key_2_idx, foreign_key_comp_idx) WHERE tbl.foreign_key_id = 1 AND tbl.foreign_key_id2 IS NOT NULL
Query 2 – with indexes (no composite index):
SELECT * FROM table tbl IGNORE INDEX(foreign_key_comp_idx) WHERE tbl.foreign_key_id = 1 AND tbl.foreign_key_id2 IS NOT NULL
Query 3 – with composite index (no other indexes):
SELECT * FROM table tbl IGNORE INDEX(foreign_key_idx, foreign_key_2_idx) WHERE tbl.foreign_key_id = 1 AND tbl.foreign_key_id2 IS NOT NULL
Query 1 (no indexes) performs a full table scan and uses 1 million records with a
total duration of 0.37 seconds.
Query 2 (indexes, no composite index) performs a non-unique key lookup on foreign_key_idx index and
uses 500K records with a total duration of 0.6 seconds.
Query 3 (composite index only) performs an index range scan on composite index and uses 480K
records with a total duration of 0.13 seconds.
What I really don’t understand is: why is query 2 (with indexes) always performing slower than query 1 (without indexes)? I’m really really stuck and need some help…
I’ve tested the queries above with different amount of rows, like 1k, 10k, 20k, 50k, 100k, 200k, 250k, 500k, 1M etc, always with the same ratio (25%), and the results where the same (query 2 always performing slow)
Thank you in advance, really appreciate any kind of input!
Edit (2 May 2016)
SHOW CREATE TABLE COMMAND:
CREATE TABLE `table` ( `ID` int(11) NOT NULL AUTO_INCREMENT, `FirstName` varchar(255) NOT NULL, `LastName` varchar(255) NOT NULL, `foreign_key_id` int(11) NOT NULL, `foreign_key_id2` int(11) DEFAULT NULL, PRIMARY KEY (`ID`), KEY `foreign_key_idx` (`foreign_key_id`), KEY `foreign_key_2_idx` (`foreign_key_id2`), KEY `foreign_key_comp_idx ` (`foreign_key_id`,`foreign_key_id2`), CONSTRAINT `foreign_key_idx` FOREIGN KEY (`foreign_key_id`) REFERENCES `table2` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION, CONSTRAINT `foreign_key_2_idx` FOREIGN KEY (`foreign_key_id2`) REFERENCES `table3` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION, ) ENGINE=InnoDB AUTO_INCREMENT=1515998 DEFAULT CHARSET=latin1
Not sure if important, but table2 has 20 records and table3 also 1 million.
The table contains 1M rows of which 250k are returned by the query.
There are 500k rows with
foreign_key_id = 1 and 500k rows with
af.foreign_key_id2 IS NOT NULL.
The query using full table scan (actually doing full index scan on the PRIMARY key in InnoDB) will read all 1M rows sequentially and check each of them for the conditions.
The query using
foreign_key_idx (and it should be the same if it used ‘foreign_key_2_idx’) has to read 500k rows by “random” order (supposing the rows were inserted or assigned the ID randomly) and check the other condition on them. That means the query reads 500k rows by their primary key from the table – but that means it will probably access 100% of data pages of the table. So in the end the query reads half of the index and all of the table – more data to go through in total and the table is accessed by random seeks, not sequentialy. It should be no surprise that such query is slower than full table scan.
The query going by
foreign_key_comp_idx will find the part of the index for
foreign_key_id = 1 by ref access method and then fetch its part satisfying
af.foreign_key_id2 IS NOT NULL by range access – finding 250k rows which it will then read from the table. Again there is a high probability that it will have to read more than 50% of the table pages but it may just be lucky with the data distribution that it is less than 100% and as a plus it does not have to check the conditions again as they are assured by the index.
What surprises me: usual understanding is that the optimizer will not even want to use an index if it will lead to fetching ~20% of the table rows, prefering full table scan instead as it is usually faster. Your
IGNORE INDEX() hints should not change that, thats what
FORCE INDEX() hint is for (telling optimizer that the table scan is extremely costly compared to the suggested index).
But maybe if the insert order was not totally random, the statistics and index dives show that (even when the expected rows numbers do not).
I know I am late to the party here.
The Cardinality of the Foreign Keys is 2, meaning you only have 2 options being indexed – not very useful. If you had 2000 possibilities for those FK’s it would be useful to index.
Every insert, update is requiring an update to the index and in this case is not necessary – every query has an extra step – again not necessary .
Think of Indexing like grouping and sorting, a pointer if you will .. if you only have two options you are grouped in large groups and not sorted much.
I can not say that this is how it is working in the db engine but this is how I like to think about indexes at this point.
If you had 2,000 possible FK values – this is where the advantage would be as the possible shrinkage (segmentation of data you are looking at could be 500 – rather than 500,000).
That is a simplified explanation – but you will notice indexes get updated on inserts (they work really great on sequences especially when you don’t insert in the middle of the sequence but at the end ) .