Is it possible to optimize this query? Currently taking days (no sign of stopping) on ~11 million records

Posted on

Question :

Table in question:

CREATE TABLE `observations` (
  `time_of_observation` datetime NOT NULL,
  `station` varchar(5) NOT NULL,
  `number` varchar(20) NOT NULL,
  `destination` varchar(100) NOT NULL,
  `type` varchar(20) NOT NULL,
  `departure_time` datetime NOT NULL,
  `owner` varchar(20) NOT NULL,
  `altered` tinyint(1) NOT NULL DEFAULT '0',
  `delay` bigint(20) NOT NULL DEFAULT '0',
  `notes` text CHARACTER SET latin1,
  `route_text` text,
  `info` text CHARACTER SET latin1,
  `delay_text` text CHARACTER SET latin1,
  KEY `uindex2` (`time_of_observation`,`departure_time`,`number`,`delay`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

With index:

+--------------+------------+----------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table        | Non_unique | Key_name | Seq_in_index | Column_name         | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------------+------------+----------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| observations |          1 | uindex2  |            1 | time_of_observation | A         |       58588 |     NULL | NULL   |      | BTREE      |         |               |
| observations |          1 | uindex2  |            2 | departure_time      | A         |     5507365 |     NULL | NULL   |      | BTREE      |         |               |
| observations |          1 | uindex2  |            3 | number              | A         |    11014731 |     NULL | NULL   |      | BTREE      |         |               |
| observations |          1 | uindex2  |            4 | delay               | A         |    11014731 |     NULL | NULL   |      | BTREE      |         |               |
+--------------+------------+----------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

Query:

SELECT time_of_observation as last_seen_on, number, departure_time,
       departure_delay as last_known_delay
FROM observations as o1
WHERE NOT EXISTS (
  SELECT 1 
  FROM observations o2
  WHERE o1.number = o2.number
  AND o1.departure_time = o2.departure_time
  AND o1.time_of_observation < o2.time_of_observation
);`

number and departure_time uniquely identify an observed entity. Goal of this query is to find for each entity the last observation and the delay seen on this last observation.

This query is currently running for 140 hours (and has not finished) with the table containing ~11 million rows.

See EXPLAIN:

+------+--------------------+-------+-------+---------------+---------+---------+------+----------+--------------------------+
| id   | select_type        | table | type  | possible_keys | key     | key_len | ref  | rows     | Extra                    |
+------+--------------------+-------+-------+---------------+---------+---------+------+----------+--------------------------+
|    1 | PRIMARY            | o1    | index | NULL          | uindex2 | 86      | NULL | 11014731 | Using where; Using index |
|    2 | DEPENDENT SUBQUERY | o2    | index | uindex2       | uindex2 | 86      | NULL | 11014731 | Using where; Using index |
+------+--------------------+-------+-------+---------------+---------+---------+------+----------+--------------------------+

Am I doing something wrong here? Is it possible to optimize this query? Should it be this slow?

Answer :

First of all you need to get rid of the dependent subquery.
For this you could rewrite your query like this:

SELECT o.time_of_observation as last_seen_on, o.number,
       o.departure_time, o.departure_delay as last_known_delay
FROM observations as o
JOIN (  SELECT
        number,
        departure_time
        MIN(time_of_observation) AS minobstime
        FROM observations
        GROUP BY 
        number,
        departure_time
) m ON m.number = o.number
   AND m.departure_time = o.departure_time
   AND m.minobstime = o.time_of_observation;

or this:

SELECT o1.time_of_observation as last_seen_on, o1.number,
       o1.departure_time, o1.departure_delay as last_known_delay
FROM observations as o1
LEFT JOIN observations as o2 ON o1.number = o1.number
  AND o1.departure_time = o2.departure_time
  AND o1.time_of_observation < o2.time_of_observation
WHERE o2.<any column, preferably your primary key> IS NULL;

Then it would be a good idea if you had a primary key. It can be an auto_increment column, too.

The columns in the index you have are in the wrong order. time_of_observation should be the last column in the index. First should be the columns where you make an equality comparison, then the ones where you scan a range.

Create an index like (number, departure_time, time_of_observation) and try one of the two queries from above.

Maybe your server version allows

SELECT MAX(time_of_observation) as last_seen_on, 
       number, 
       departure_time, 
       FIRST_VALUE(departure_delay) 
             OVER (PARTITION BY number, departure_time 
                   ORDER BY time_of_observation DESC /* , departure_delay DESC */ ) as last_known_delay
FROM observations
GROUP BY number, departure_time

?

This is a “groupwise-max” problem. Your approach will do about 60 trillion operations!

Every InnoDB table should have an explicit PRIMARY KEY. If this is UNIQUE, then make it the PK:

(`time_of_observation`,`departure_time`,`number`,`delay`)

But, that is not the optimal order, especially for this query (even after fixed by tombom’s first suggestion). It needs an index starting with (number, departure_time, time_of_observation, ...). If the above 4 columns are unique, then do

PRIMARY KEY (number, departure_time, time_of_observation, delay)

If not unique, then add an id:

id  INT UNSIGNED  AUTO_INCREMENT  NOT NULL,
...
PRIMARY KEY (number, departure_time, time_of_observation, delay, id),
INDEX(id)

Leave a Reply

Your email address will not be published. Required fields are marked *