Question :
I have a subquery with a column named code
that only has 1 and 0 values and an ID
column because I query a lot of codes for each ID. I want to group by said column (and by ID
) and by ocurrence but want to ignore groups with less than 3 records, so that bigger groups get merged.
I use ROW_NUMBER() - ROW_NUMBER() over (partition ...)
to generate a group number which I use to group records by, but sometimes the records get mixed up caused by a coincidence in the sequence of the row number and the partitioned row number. This is an example
code | row_number | row_number_partition_by_code | group_number
------------------------------------------------------------------
0 | 1 | 1 | 0
0 | 2 | 2 | 0
1 | 3 | 1 | 2
1 | 4 | 2 | 2
0 | 5 | 3 | 2
1 | 6 | 3 | 3
0 | 7 | 4 | 3
0 | 8 | 5 | 3
0 | 9 | 6 | 3
0 | 10 | 7 | 3
1 | 11 | 4 | 7
0 | 12 | 1 | 0
As you can see, some groups contain both codes!
After this, I do
GROUP BY group_number
HAVING COUNT(1) > 2
to eliminate small groups, and generate another group number to merge adjacent groups.
How can I group these values homogeneously?
More Details
All records are ordered by ID
and by date (the later in descending order) in every ROW_NUMBER()
.
The final result I expect is the time in minutes of the latest code group for each ID
, and merging groups that are separated by less than 3 minutes.
I’m more interested in the groups with the value 1, so if the latest code group is a 0 ( considering it is not a couple minutes long ) then ignore that ID
completely.
Usually, there is one record per minute. Sometimes it can take hours without a record for any ID
Answer :
I’ve set up next example according to your data and your explanations:
All records are ordered by ID and by date (the later in descending order)
CREATE TABLE tbl (id int, code int, dt datetime);
INSERT INTO tbl VALUES
(1, 0, '20180112 10:10:40'),
(1, 0, '20180112 10:09:10'),
(1, 1, '20180112 10:08:15'),
(1, 1, '20180112 10:06:00'),
(1, 0, '20180112 10:05:00'),
(1, 1, '20180112 10:04:55'),
(1, 0, '20180112 10:03:40'),
(1, 0, '20180112 10:03:30'),
(1, 0, '20180112 10:03:20'),
(1, 0, '20180112 10:02:10'),
(1, 1, '20180112 10:01:30'),
(1, 0, '20180112 10:01:15');
IMHO this is more about GROUPING AND WINDOW than GAPS AND ISLANDS. You can set the groups in this way:
SELECT id, code, dt,
SUM(rst) OVER (PARTITION BY id ORDER BY id, dt DESC) grp
FROM (SELECT id, code, dt,
IIF(COALESCE(LAG(code) OVER (PARTITION BY id ORDER BY id, dt DESC), -1) <> code, 1, 0) rst
FROM tbl) x;
id | code | dt | grp -: | ---: | :------------------ | --: 1 | 0 | 12/01/2018 10:10:40 | 1 1 | 0 | 12/01/2018 10:09:10 | 1 1 | 1 | 12/01/2018 10:08:15 | 2 1 | 1 | 12/01/2018 10:06:00 | 2 1 | 0 | 12/01/2018 10:05:00 | 3 1 | 1 | 12/01/2018 10:04:55 | 4 1 | 0 | 12/01/2018 10:03:40 | 5 1 | 0 | 12/01/2018 10:03:30 | 5 1 | 0 | 12/01/2018 10:03:20 | 5 1 | 0 | 12/01/2018 10:02:10 | 5 1 | 1 | 12/01/2018 10:01:30 | 6 1 | 0 | 12/01/2018 10:01:15 | 7
Once you have the groups you can calculate time difference of each group:
SELECT id, code, grp, MAX(dt) maxDT, MIN(dt) minDT,
DATEDIFF(minute, MIN(dt), MAX(dt)) as minutes
FROM (
SELECT id, code, dt,
SUM(rst) OVER (PARTITION BY id ORDER BY id, dt DESC) grp
FROM (SELECT id, code, dt,
IIF(COALESCE(LAG(code) OVER (PARTITION BY id ORDER BY id, dt DESC), -1) <> code, 1, 0) rst
FROM tbl) x
) y
GROUP BY id, code, grp
ORDER BY id, grp;
id | code | grp | maxDT | minDT | minutes -: | ---: | --: | :------------------ | :------------------ | ------: 1 | 0 | 1 | 12/01/2018 10:10:40 | 12/01/2018 10:09:10 | 1 1 | 1 | 2 | 12/01/2018 10:08:15 | 12/01/2018 10:06:00 | 2 1 | 0 | 3 | 12/01/2018 10:05:00 | 12/01/2018 10:05:00 | 0 1 | 1 | 4 | 12/01/2018 10:04:55 | 12/01/2018 10:04:55 | 0 1 | 0 | 5 | 12/01/2018 10:03:40 | 12/01/2018 10:02:10 | 1 1 | 1 | 6 | 12/01/2018 10:01:30 | 12/01/2018 10:01:30 | 0 1 | 0 | 7 | 12/01/2018 10:01:15 | 12/01/2018 10:01:15 | 0
Then if I understand your comment correctly:
I’m more interested in the groups with the value 1, so if the latest code group is a 0 ( considering it is not a couple minutes long ) then ignore that ID completely.
Rows can be filtered and get last row using:
SELECT TOP 1
id, code, grp, MAX(dt) maxDT, MIN(dt) minDT,
DATEDIFF(minute, MIN(dt), MAX(dt)) as minutes
FROM (
SELECT id, code, dt,
SUM(rst) OVER (PARTITION BY id ORDER BY id, dt DESC) grp
FROM (SELECT id, code, dt,
IIF(COALESCE(LAG(code) OVER (PARTITION BY id ORDER BY id, dt DESC), -1) <> code, 1, 0) rst
FROM tbl) x
) y
GROUP BY id, code, grp
HAVING (code = 0 AND DATEDIFF(minute, MIN(dt), MAX(dt)) >= 2)
OR code = 1
ORDER BY id, grp DESC;
id | code | grp | maxDT | minDT | minutes -: | ---: | --: | :------------------ | :------------------ | ------: 1 | 1 | 6 | 12/01/2018 10:01:30 | 12/01/2018 10:01:30 | 0
dbfiddle here