Question :
I have the SQL statement:
SELECT erp_orders_id
FROM erp_orders o
WHERE o.orders_export_time >= '2015-09-20'
ORDER BY o.erp_orders_id ASC
LIMIT 1
where erp_orders_id
is primary key of type int. and orders_export_time
is timestamp
.
I’m trying to optimize it by creating an index on both orders_export_time
and erp_orders_id
:
ALTER TABLE `erp_orders`
ADD KEY `IDX_ORDERID_EXPORTTIME`(`erp_orders_id`,`orders_export_time`) USING BTREE;
However, EXPLAIN
shows it chooses PRIMARY
to filter data:
+--------------+-----------------------+-----------------+----------------+-------------------------+---------------+-------------------+---------------+----------------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+--------------+-----------------------+-----------------+----------------+-------------------------+---------------+-------------------+---------------+----------------+-----------------+
| 1 | SIMPLE | o | index | orders_export_time | PRIMARY | 4 | | 5 | Using where |
+--------------+-----------------------+-----------------+----------------+-------------------------+---------------+-------------------+---------------+----------------+-----------------+
QUESTIONS
1 : I don’t understand why primary is chosen here?
So I tried to force using index IDX_ORDERID_EXPORTTIME
and get the following plan which seems good to me:
+--------------+-----------------------+-----------------+----------------+-------------------------+------------------------+-------------------+---------------+----------------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+--------------+-----------------------+-----------------+----------------+-------------------------+------------------------+-------------------+---------------+----------------+--------------------------+
| 1 | SIMPLE | o | index | | IDX_ORDERID_EXPORTTIME | 8 | | 1 | Using where; Using index |
+--------------+-----------------------+-----------------+----------------+-------------------------+------------------------+-------------------+---------------+----------------+--------------------------+
With the plan above, I assume it would be much faster than the first query. However, when I execute it, it runs pretty slow, even slower than the first one. So
2 : Why is this query so slow even with execution plan indicates that it scanned only 1 row?
3 : How am I supposed to optimize this query?
Answer :
No single index (composite or not) will be optimal
WHERE x > ...
ORDER BY y
You can provide both of these, and let the Optimizer pick:
INDEX(x) -- filters out unwanted rows, but you are left with a sort
INDEX(y) -- avoids sort, but does no filtering
On the other hand, this would be a “covering” index for that specific query:
INDEX(orders_export_time, erp_orders_id)
and might be better than putting them in the opposite order.
Q1: I don’t understand why primary is chosen here?
The index you made helps neither the WHERE
or the ORDER BY
due to the columns in the wrong order.
Since you selected only the erp_orders_id
, the PRIMARY KEY
was chosen.
Q2: Why is this query so slow even with execution plan indicates that it scanned only 1 row?
Since you forced the index, the entire index was scanned until one row was found that matched o.orders_export_time >= '2015-09-20'
. If you have rows from earlier months, even earlier years, those rows would be read until '2015-09-20'
was finally found.
Q3 how am I supposed to optimize this query?
As @dnoeth said in the comment, reverse the column order in the index.
ALTER TABLE `erp_orders`
ADD KEY `IDX_ORDERID_EXPORTTIME`(`orders_export_time`,`erp_orders_id`) USING BTREE;
Effective indexes have in the INDEX KEY COLUMNS
the column names that are used for row restriction and processing; those columns in the WHERE, JOIN, ORDER BY, GROUP BY
:
This would be from your example orders_export_time
.
If more than 1 column in needed here, they should be added in order from top to bottom by those that will be the MOST restrictive – the column that will reduce your overall results by the largest amount of records. For example, if you had columns FIRST_NAME
and LAST_NAME
in your key columns, LAST_NAM
E would most likely go first as it would reduce your results more than a first name would.
Then, in the INCLUDED COLUMNS
tab, you add your column names that are in your SELECT. (This is why trying to index a SELECT * FROM
could never truly be indexed well. It’d just be too large and inefficient).
This would be from your example erp_orders_id
.
Always keep in mind though that the PRIMARY KEY CLUSTERED INDEX
on your table is always already automatically included by default in every other non-clustered index’s INCLUDED COLUMNS
, so other than helping to visualize all pieces of an index, if “erp_orders_id
” is your PRIMARY CLUSTERED INDEX
on the table already, putting it into the include is not necessary. Adding your PRIMARY CLUSTERED INDEX
column name into another non-clustered index’s INDEX KEY COLUMNS
is not recommended, bloats your table, and increases your index/statistics maintenance.
Index FILLFACTOR
is another topic, but should be considered when creating any new index.
CREATE NONCLUSTERED INDEX [ name your index here ] ON [dbo].[erp_orders]
(
[orders_export_time] ASC
)
INCLUDE ([erp_orders_id])
WITH(FILLFACTOR = ?)