Question :
Table in question:
CREATE TABLE `observations` (
`time_of_observation` datetime NOT NULL,
`station` varchar(5) NOT NULL,
`number` varchar(20) NOT NULL,
`destination` varchar(100) NOT NULL,
`type` varchar(20) NOT NULL,
`departure_time` datetime NOT NULL,
`owner` varchar(20) NOT NULL,
`altered` tinyint(1) NOT NULL DEFAULT '0',
`delay` bigint(20) NOT NULL DEFAULT '0',
`notes` text CHARACTER SET latin1,
`route_text` text,
`info` text CHARACTER SET latin1,
`delay_text` text CHARACTER SET latin1,
KEY `uindex2` (`time_of_observation`,`departure_time`,`number`,`delay`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
With index:
+--------------+------------+----------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------------+------------+----------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| observations | 1 | uindex2 | 1 | time_of_observation | A | 58588 | NULL | NULL | | BTREE | | |
| observations | 1 | uindex2 | 2 | departure_time | A | 5507365 | NULL | NULL | | BTREE | | |
| observations | 1 | uindex2 | 3 | number | A | 11014731 | NULL | NULL | | BTREE | | |
| observations | 1 | uindex2 | 4 | delay | A | 11014731 | NULL | NULL | | BTREE | | |
+--------------+------------+----------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Query:
SELECT time_of_observation as last_seen_on, number, departure_time,
departure_delay as last_known_delay
FROM observations as o1
WHERE NOT EXISTS (
SELECT 1
FROM observations o2
WHERE o1.number = o2.number
AND o1.departure_time = o2.departure_time
AND o1.time_of_observation < o2.time_of_observation
);`
number
and departure_time
uniquely identify an observed entity. Goal of this query is to find for each entity the last observation and the delay seen on this last observation.
This query is currently running for 140 hours (and has not finished) with the table containing ~11 million rows.
See EXPLAIN:
+------+--------------------+-------+-------+---------------+---------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------------+-------+-------+---------------+---------+---------+------+----------+--------------------------+
| 1 | PRIMARY | o1 | index | NULL | uindex2 | 86 | NULL | 11014731 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | o2 | index | uindex2 | uindex2 | 86 | NULL | 11014731 | Using where; Using index |
+------+--------------------+-------+-------+---------------+---------+---------+------+----------+--------------------------+
Am I doing something wrong here? Is it possible to optimize this query? Should it be this slow?
Answer :
First of all you need to get rid of the dependent subquery.
For this you could rewrite your query like this:
SELECT o.time_of_observation as last_seen_on, o.number,
o.departure_time, o.departure_delay as last_known_delay
FROM observations as o
JOIN ( SELECT
number,
departure_time
MIN(time_of_observation) AS minobstime
FROM observations
GROUP BY
number,
departure_time
) m ON m.number = o.number
AND m.departure_time = o.departure_time
AND m.minobstime = o.time_of_observation;
or this:
SELECT o1.time_of_observation as last_seen_on, o1.number,
o1.departure_time, o1.departure_delay as last_known_delay
FROM observations as o1
LEFT JOIN observations as o2 ON o1.number = o1.number
AND o1.departure_time = o2.departure_time
AND o1.time_of_observation < o2.time_of_observation
WHERE o2.<any column, preferably your primary key> IS NULL;
Then it would be a good idea if you had a primary key. It can be an auto_increment column, too.
The columns in the index you have are in the wrong order. time_of_observation
should be the last column in the index. First should be the columns where you make an equality comparison, then the ones where you scan a range.
Create an index like (number, departure_time, time_of_observation)
and try one of the two queries from above.
Maybe your server version allows
SELECT MAX(time_of_observation) as last_seen_on,
number,
departure_time,
FIRST_VALUE(departure_delay)
OVER (PARTITION BY number, departure_time
ORDER BY time_of_observation DESC /* , departure_delay DESC */ ) as last_known_delay
FROM observations
GROUP BY number, departure_time
?
This is a “groupwise-max” problem. Your approach will do about 60 trillion operations!
Every InnoDB table should have an explicit PRIMARY KEY
. If this is UNIQUE, then make it the PK:
(`time_of_observation`,`departure_time`,`number`,`delay`)
But, that is not the optimal order, especially for this query (even after fixed by tombom’s first suggestion). It needs an index starting with (number, departure_time, time_of_observation, ...)
. If the above 4 columns are unique, then do
PRIMARY KEY (number, departure_time, time_of_observation, delay)
If not unique, then add an id:
id INT UNSIGNED AUTO_INCREMENT NOT NULL,
...
PRIMARY KEY (number, departure_time, time_of_observation, delay, id),
INDEX(id)