Question :
I want to JOIN two tables and get the latest result from each one of those two in a single table. I’m kind of concerned about speed too since tables are growing kinda fast. Close to 60-70k records a day. Later on i will go into partitioning but that is another issue. Now i have a main table with the devices
info.
+--------+-----------+---------+
| id | Name | type |
+--------+-----------+---------+
| 1 | Oh | A |
| 2 | This | A |
| 3 | Is | B |
| 4 | Hard | A |
+--------+-----------+---------+
According to the type they have some data in different tables
Type A is
+--------+-----------+------------------+---------+---------+
| id | device_id | stats_time | status | noise |
+--------+-----------+------------------+---------+---------+
| 1 | 1 | 2012-10-23 07:50 | foo | 10 |
| 2 | 1 | 2012-10-23 16:59 | bar | 12 |
| 3 | 2 | 2012-10-23 15:11 | bar | 0 |
| 4 | 4 | 2012-10-23 23:23 | foo | 25 |
+--------+-----------+------------------+---------+---------+
Type B is
+--------+-----------+------------------+---------+---------+
| id | device_id | stats_time | status | signal |
+--------+-----------+------------------+---------+---------+
| 1 | 3 | 2012-10-23 04:50 | foo | 1000 |
| 2 | 3 | 2012-10-23 05:59 | bar | 450 |
| 3 | 3 | 2012-10-23 09:11 | bar | 980 |
| 4 | 3 | 2012-10-23 10:23 | foo | 0 |
+--------+-----------+------------------+---------+---------+
I’ve been busting my head for a query to end up with something like this
+--------+-----------+------------------+---------+---------+---------+
| id | device_id | stats_time | status | signal | noise |
+--------+-----------+------------------+---------+---------+---------+
| 1 | 1 | 2012-10-23 16:59 | bar | 12 | |
| 2 | 2 | 2012-10-23 15:11 | bar | 0 | |
| 3 | 3 | 2012-10-23 10:23 | foo | | 0 |
| 4 | 4 | 2012-10-23 23:23 | foo | 25 | |
+--------+-----------+------------------+---------+---------+---------+
Using the below query is not good since i get two columns of stats_time
SELECT devices.id AS id, A.stats_time , B.stats_time
FROM devices
LEFT JOIN A ON devices.id = A.device_id
LEFT JOIN B ON devices.id = B.device_id
GROUP BY devices.id
Before i ended up using different tables for the device types i used to get the results with the following but ended up going real slow
SELECT *
FROM (
SELECT *
FROM A
ORDER BY stats_time DESC, id ASC
) AS d
RIGHT JOIN devices ON A.device_id = devices.id
GROUP BY devices.id
Answer :
I see it as 2 steps:
- Build tables with just the latest signal (or noise) for each device
JOIN
orUNION
the two tables.
Step 1 is a variant of groupwise max:
SELECT device_id, stats_time, status, noise -- The desired columns
FROM ( SELECT @prev := '' ) init
JOIN ( SELECT
device_id != @prev AS first, -- `device_id` is the 'GROUP BY'
@prev := device_id, -- the 'GROUP BY'
device_id, stats_time, status, noise -- Also the desired columns
FROM TableA -- The table
ORDER BY device_id DESC, -- The 'GROUP BY'
stats_time DESC -- to get latest
) x
WHERE first;
This may be beneficial to performance:
INDEX(device_id, stats_time)
Ditto for TableB
and signal
. Manually run them to see if I got them right.
Your example does not show a case where both signal
and noise
exist for the same device_id
. I will assume that is really the case, hence UNION
:
Step 2:
SELECT device_id, stats_time, status, signal, noise
FROM
( SELECT device_id, stats_time, status, signal, '' AS noise
... (the rest of the signal query)
)
UNION ALL
( SELECT device_id, stats_time, status, '' AS signal, noise
... (the rest of the noise query)
);
I stated in the above comments that this seems to be a case of a supertype-subtype relationship but, since changing your database structure may be a decision that lies outside of the scope of this question, I am going to focus on offering a solution for your current situation.
Then, after a couple of edits, I decided to include some DDL
statements and a brief description of my assumptions about your database structure, hoping that in this way my answer and its involved queries will be more easily comprehended.
Understanding of the situation and hypothetical DDL
As I percive the scenario, type_a
and type_b
may be two different kinds of reading
or measurement
that you are collecting for each device.
This way, although I am not completely sure of the meaning of the type_a.type_a_id
and type_b.type_b_id
columns, I assume that they are some sort of sequential_number
or row_number
or record_identifier
in each respective table. Likewise, the columns type_a.stats_time
and type_b.stats_time
are the exact points in time when a device
posts some sort of reading
.
I have also assumed that type_a.device_id
and type_b.device_id
are FOREIGN KEYS that make reference to the device
table that, in turn, has the column device.device_id
being used as some kind of sequential_number
or row_number
or record_identifier
that you have defined as PRIMARY KEY.
Having described my understanding of the state of affairs, please note that I am not suggesting that this is the optimal structure (since, naturally, I am not familiar with the real scenario, it may lack normalization, integrity, etc.), I am just making some assumptions based on the data samples and queries that you have presented, in order to deliver a possible solution for your specific situation. So, here is the speculative DDL
:
CREATE TABLE device
(
device_id INT NOT NULL AUTO_INCREMENT,
name CHAR(30) NOT NULL,
type CHAR(1) NOT NULL,
PRIMARY KEY (device_id),
UNIQUE INDEX uix_name (name)
);
CREATE TABLE type_a
(
type_a_id INT NOT NULL AUTO_INCREMENT,
device_id INT NOT NULL ,
stats_time DATETIME NOT NULL,
`status` CHAR(10) NOT NULL,
noise INT NOT NULL,
PRIMARY KEY (type_a_id),
CONSTRAINT FOREIGN KEY fk_type_a_device (device_id)
REFERENCES device(device_id),
UNIQUE INDEX `uix_device_id_and_stats_time` (device_id, stats_time)
);
CREATE TABLE type_b
(
type_b_id INT NOT NULL AUTO_INCREMENT,
device_id INT NOT NULL,
stats_time DATETIME NOT NULL,
`status` CHAR(10) NOT NULL,
`signal` INT NOT NULL,
PRIMARY KEY (type_b_id),
CONSTRAINT FOREIGN KEY fk_type_b_device (device_id)
REFERENCES device(device_id),
UNIQUE INDEX `uix_device_id_and_stats_time` (device_id, stats_time)
);
-- Some ‘device’ INSERTS...
INSERT INTO device (name, type) VALUES ('First device', 'A');
INSERT INTO device (name, type) VALUES ('Second device', 'A');
-- ... And then, some ‘type_a’ and ‘type_b’ INSERTS
-- in order to have some sample data for retrieving.
INSERT INTO type_a (device_id, stats_time, `status`, noise)
VALUES (1, STR_TO_DATE('06/01/2015 08:10:01 AM', '%c/%e/%Y %r'), 'Foo', 123);
INSERT INTO type_a (device_id, stats_time, `status`, noise)
VALUES (1, STR_TO_DATE('04/04/2015 03:07:34 PM', '%c/%e/%Y %r'), 'Bar', 456);
INSERT INTO type_b (device_id, stats_time, `status`, `signal`)
VALUES (2, STR_TO_DATE('03/04/2015 02:08:15 PM', '%c/%e/%Y %r'), 'Boo', 789);
INSERT INTO type_b (device_id, stats_time, `status`, `signal`)
VALUES (2, STR_TO_DATE('05/07/2015 04:03:12 PM', '%c/%e/%Y %r'), 'Far', 852);
Initial proposal
And then, here is my first submited query which, while keeping the original idea, has been reformatted and adapted to the DDL
structure that was added later:
SELECT DE.device_id,
COALESCE(TA.type_a_id, TB.type_b_id) AS type_id,
DE.name,
DE.type,
COALESCE(TA.stats_time, TB.stats_time) AS stats_time,
COALESCE(TA.status, TB.status) AS `status`,
COALESCE(TA.noise, 0) AS noise,
COALESCE(TB.signal, 0) AS `signal`
FROM device DE
LEFT OUTER JOIN type_a TA
ON TA.device_id = DE.device_id
LEFT OUTER JOIN type_b TB
ON TB.device_id = DE.device_id
ORDER BY stats_time DESC;
As you can see, I am using the COALESCE() function that only if the column type_a.stats_time
contains a NULL
value, then ‘prints’ the value of the column type_b.stats_time
, and works in a similar way for the type_a.status
and type_b.status
columns.
Subsequent proposal
First method
Now, having reread your question and after some comment interactions, I understand that what you actually want is to obtain a single result set containing only two rows —one row including columns corresponding to the latest type_a.stats_time
value and one row holding columns related to the latest type_b.stats_time
value—. Therefore I suggest you the following query that also retrieves data from the DDL
proposed above:
(SELECT DE.device_id,
TA.type_a_id AS type_id,
DE.name,
DE.type,
TA.stats_time AS stats_time,
TA.status,
TA.noise,
NULL AS `signal`
FROM device DE
JOIN type_a TA
ON TA.device_id = DE.device_id
WHERE TA.stats_time = (SELECT MAX(stats_time)
FROM type_a))
UNION
(SELECT DE.device_id,
TB.type_b_id AS type_id,
DE.name,
DE.type,
TB.stats_time AS stats_time,
TB.status,
NULL AS noise,
TB.signal
FROM device DE
JOIN type_b TB
ON TB.device_id = DE.device_id
WHERE TB.stats_time = (SELECT MAX(stats_time)
FROM type_b))
ORDER BY stats_time DESC;
Note the use of the UNION operator, which in serves the purpose of combining in one single result set the latest row of type_a
(based on the newest stats_time
column value, obtained via the MAX() function within a subquery in the WHERE clause) with the latest row in type_b
(likewise, based on the newest stats_time
column value, also obtained through the MAX() function within a subquery in the WHERE clause).
Second method
You can also try with this alternate query, which sorts each one of the combined result sets based on the stats_time
column in each respective SELECT statement using the ORDER BY and LIMIT clauses.
(SELECT DE.device_id,
TA.type_a_id AS type_id,
DE.name,
DE.type,
TA.stats_time AS stats_time,
TA.status,
TA.noise,
NULL AS `signal`
FROM device DE
JOIN type_a TA
ON TA.device_id = DE.device_id
ORDER BY TA.stats_time DESC LIMIT 1)
UNION
(SELECT DE.device_id,
TB.type_b_id AS type_id,
DE.name,
DE.type,
TB.stats_time AS stats_time,
TB.status,
NULL AS noise,
TB.signal
FROM device DE
JOIN type_b TB
ON TB.device_id = DE.device_id
ORDER BY TB.stats_time DESC LIMIT 1)
ORDER BY stats_time DESC;
Once you compare the performance of all the suggested methods, it will be easy to define which one is the best for your needs. Also, if one of them solves your problem, then you can set it up as a VIEW
, this way future data retrieval will be easier to obtain.
About the speed performance aspect of your queries, you may start refining such matters by paying special attention to the indexes definition in the apropiate columns, e.g. type_a.stats_time
and type_b.stats_time
.