Joining two tables by the nearest match

Posted on

Question :

General problem: I want to do a left join on two huge tables, but there is no matching key – I want to join the left table with the “nearest” row in the right table. In this example I want to join using a timestamp, but I think one would have a similar (though more complex) problem with geographical coordinates.

To simplify a lot, I have one transaction table with a primary key id, timestamp, currency and amount – and one currency rate table with id, timestamp, currency and rate. Those tables are independent from each other, and they contain a lot of rows; assume that for some periods there are many transactions between each update of the currency rate, while in other periods there are many updates of the currency rates for each transaction. I want to join them so that there is one row for each transaction, with the closest matching currency rate. I have indexes on the timestamps.

So, here is a brief example of what I have and what I need:

transaction:
+---------------------+--------+
| timestamp           | amount |
+---------------------+--------+
| 2017-07-31 00:07:05 | 173.17 |
| 2017-08-02 15:29:11 | 136.57 |
| 2017-08-05 03:45:24 |  81.27 |
| 2017-08-05 03:46:47 |   48.1 |
| 2017-08-05 03:47:38 |  35.21 |
+---------------------+--------+

rate:
+---------------------+-------------------+
| timestamp           | rate              |
+---------------------+-------------------+
| 2017-07-31 00:04:49 |            9.2923 |
| 2017-07-31 01:37:59 | 9.313423370522607 |
| 2017-08-01 08:07:59 |             9.325 |
| 2017-08-01 16:52:23 |            9.3542 |
| 2017-08-01 21:07:09 | 9.357076262185192 |
| 2017-08-02 15:07:33 |  9.34936993421895 |
| 2017-08-02 17:48:45 | 9.357217848393876 |
| 2017-08-04 04:33:51 |  9.38690807898736 |
| 2017-08-04 08:13:01 | 9.383765889641701 |
| 2017-08-06 03:45:03 | 9.118193727124817 |
| 2017-08-06 04:15:01 | 9.353042966450854 |
| 2017-08-06 05:23:29 | 9.353042966450854 |
+---------------------+-------------------+

Want something like this:
+---------------------+--------+--------+
| timestamp           | amount | rate   |
+---------------------+--------+--------+
| 2017-07-31 00:07:05 | 173.17 | 9.2923 |
| 2017-08-02 15:29:11 | 136.57 | 9.3493 |
| 2017-08-05 03:45:24 |  81.27 | 9.3869 |
| 2017-08-05 03:46:47 |   48.1 | 9.3869 |
| 2017-08-05 03:47:38 |  35.21 | 9.3869 |
+---------------------+--------+--------+

(The production data is much more dense than this)

Post-edit: I did try to use a subquery, it failed spectacularly but probably because of a mistake from my side. Subquery seems to be the best way to solve this, so I’ve moved it from the question side into an answer.

Answer :

Create a table with the UNION the two tables; ORDER BY the timestamp; create an AUTO_INCREMENT.

Now “nearby” has become “adjacent”.

Use a self join to compare id to id+1 to look at adjacent items.

I want to join the left table with the “nearest” row in the right table.

This is a KNN problem. For it you’ll likely want a real database. I highly suggest you migrate to PostgreSQL and use PostGIS, where this can be solved using an index. Your answer has 0 to do with your problem. Your question is,

by the nearest match

Your answer is only looking at one way directionality, and it’s only doing it for one row not a set: it’s a correlated sub-query. You simply need to change course, or bounty a developer to include KNN in MySQL. MySQL is using Boost for GIS, so maybe someday someone will jump nearest(Point const &, unsigned) in the code base. I checked for ya, it’s not there and there is no professional way to do what you want at any scale.

MySQL’s GIS offering is the worst out of all the majors players. No one uses it. To give you an idea of scale,

  • MySQL has 225 questions on GIS.SE.
  • PostGIS has 5,771

It’s a marketing point to say it supports a spatial index. See also

BTW, PostGIS also supports KNN on timestampg using btree_gist

Think I figured out of this – it seems quite easy to fix this through a subquery:

SELECT
    transaction.*,
(SELECT rate
 FROM rate
 WHERE rate.timestamp<transaction.timestamp and rate.currency=transaction.currency
 ORDER by created_at DESC
 LIMIT 1) as rate
FROM transaction;

This is simplified, in the production table I also use a currency code in the join. At first attempt, this abruptly grinded to a halt because the transaction table contained currencies not covered in the rates table.

It should also be trivial to fix this by making a loop in an external script, i.e. the transaction table could contain an extra redundant helper column containing the primary key in the rate table, this could be inserted as the transactions are made or by some batch job. (credits to @steve)

Leave a Reply

Your email address will not be published. Required fields are marked *