How to optimize SQL with WHERE and ORDER BY

Posted on

Question :

I have the SQL statement:

SELECT erp_orders_id
FROM erp_orders o
WHERE o.orders_export_time >= '2015-09-20'
ORDER BY o.erp_orders_id ASC
LIMIT 1

where erp_orders_id is primary key of type int. and orders_export_time is timestamp.
I’m trying to optimize it by creating an index on both orders_export_time and erp_orders_id:

ALTER TABLE `erp_orders` 
ADD KEY `IDX_ORDERID_EXPORTTIME`(`erp_orders_id`,`orders_export_time`) USING BTREE;

However, EXPLAIN shows it chooses PRIMARY to filter data:

+--------------+-----------------------+-----------------+----------------+-------------------------+---------------+-------------------+---------------+----------------+-----------------+
| id           | select_type           | table           | type           | possible_keys           | key           | key_len           | ref           | rows           | Extra           |
+--------------+-----------------------+-----------------+----------------+-------------------------+---------------+-------------------+---------------+----------------+-----------------+
| 1            | SIMPLE                | o               | index          | orders_export_time      | PRIMARY       | 4                 |               | 5              | Using where     |
+--------------+-----------------------+-----------------+----------------+-------------------------+---------------+-------------------+---------------+----------------+-----------------+

QUESTIONS

1 : I don’t understand why primary is chosen here?

So I tried to force using index IDX_ORDERID_EXPORTTIME and get the following plan which seems good to me:

+--------------+-----------------------+-----------------+----------------+-------------------------+------------------------+-------------------+---------------+----------------+--------------------------+
| id           | select_type           | table           | type           | possible_keys           | key                    | key_len           | ref           | rows           | Extra                    |
+--------------+-----------------------+-----------------+----------------+-------------------------+------------------------+-------------------+---------------+----------------+--------------------------+
| 1            | SIMPLE                | o               | index          |                         | IDX_ORDERID_EXPORTTIME | 8                 |               | 1              | Using where; Using index |
+--------------+-----------------------+-----------------+----------------+-------------------------+------------------------+-------------------+---------------+----------------+--------------------------+

With the plan above, I assume it would be much faster than the first query. However, when I execute it, it runs pretty slow, even slower than the first one. So

2 : Why is this query so slow even with execution plan indicates that it scanned only 1 row?

3 : How am I supposed to optimize this query?

Answer :

No single index (composite or not) will be optimal

WHERE x > ...
ORDER BY y

You can provide both of these, and let the Optimizer pick:

INDEX(x) -- filters out unwanted rows, but you are left with a sort
INDEX(y) -- avoids sort, but does no filtering

On the other hand, this would be a “covering” index for that specific query:

INDEX(orders_export_time, erp_orders_id)

and might be better than putting them in the opposite order.

Q1: I don’t understand why primary is chosen here?

The index you made helps neither the WHERE or the ORDER BY due to the columns in the wrong order.

Since you selected only the erp_orders_id, the PRIMARY KEY was chosen.

Q2: Why is this query so slow even with execution plan indicates that it scanned only 1 row?

Since you forced the index, the entire index was scanned until one row was found that matched o.orders_export_time >= '2015-09-20'. If you have rows from earlier months, even earlier years, those rows would be read until '2015-09-20' was finally found.

Q3 how am I supposed to optimize this query?

As @dnoeth said in the comment, reverse the column order in the index.

ALTER TABLE `erp_orders` 
ADD KEY `IDX_ORDERID_EXPORTTIME`(`orders_export_time`,`erp_orders_id`) USING BTREE;

Effective indexes have in the INDEX KEY COLUMNS the column names that are used for row restriction and processing; those columns in the WHERE, JOIN, ORDER BY, GROUP BY:

This would be from your example orders_export_time.

If more than 1 column in needed here, they should be added in order from top to bottom by those that will be the MOST restrictive – the column that will reduce your overall results by the largest amount of records. For example, if you had columns FIRST_NAME and LAST_NAME in your key columns, LAST_NAME would most likely go first as it would reduce your results more than a first name would.

Then, in the INCLUDED COLUMNS tab, you add your column names that are in your SELECT. (This is why trying to index a SELECT * FROM could never truly be indexed well. It’d just be too large and inefficient).

This would be from your example erp_orders_id.

Always keep in mind though that the PRIMARY KEY CLUSTERED INDEX on your table is always already automatically included by default in every other non-clustered index’s INCLUDED COLUMNS, so other than helping to visualize all pieces of an index, if “erp_orders_id” is your PRIMARY CLUSTERED INDEX on the table already, putting it into the include is not necessary. Adding your PRIMARY CLUSTERED INDEX column name into another non-clustered index’s INDEX KEY COLUMNS is not recommended, bloats your table, and increases your index/statistics maintenance.

Index FILLFACTOR is another topic, but should be considered when creating any new index.

CREATE NONCLUSTERED INDEX [ name your index here ] ON [dbo].[erp_orders]
(
    [orders_export_time] ASC
)
INCLUDE ([erp_orders_id]) 
WITH(FILLFACTOR = ?)

Leave a Reply

Your email address will not be published. Required fields are marked *