Group data by non-unique keys by distinct time range

Posted on

Question :

I have data in an Oracle table like the following:

PROJ_NBR     STATUS    START_DT        
AL20         AC        1/14/2010 4:31        
AL20         AC        1/14/2010 4:32        
AL20         AC        1/14/2010 4:32        
AL20         DE        1/14/2010 4:34        
AL20         DE        1/14/2010 4:46        
AL20         DE        1/14/2010 4:46        
AL20         DE        1/14/2010 4:46        
AL20         DE        1/14/2010 4:46        
AL20         DE        1/14/2010 4:46        
AL20         DE        1/14/2010 4:46        
AL20         DE        1/14/2010 4:46        
AL20         DE        1/14/2010 4:46        
AL20         DE        1/14/2010 4:46        
AL20         DE        1/14/2010 4:46        
AL20         AC        3/4/2010 4:31        
AL20         AC        3/4/2010 4:39        
AL20         AC        3/21/2010 13:24        
AL20         AC        2/4/2011 13:54        --year changes here
AL20         AC        2/4/2011 14:14        
AL20         AC        2/4/2011 14:27        
AL20         DE        2/11/2011 16:24        

Our table has 2,091,348 rows.

I want data like following…

PROJ_NBR     STATUS    BEGIN                END
AL20         AC        1/14/2010 4:31       1/14/2010 4:32
AL20         DE        1/14/2010 4:34       1/14/2010 4:46
AL20         AC        3/4/2010 4:31        2/4/2011 14:27  --Second time range for AL20, AC
AL20         DE        2/11/2011 16:24      2/11/2011 16:24

Answer :

Below a complete solution for sql-server.

I was trying analytical functions, but didn’t succeed. I’m tempted to repost the question to SO requiring a solution using analytical functions. I think there must be one.

And here is my working solution. I’m proud not to have used cursors.

set language us_english
Set Nocount ON

create table #t( PROJ_NBR char(4),  STATUS char(2), START_DT datetime);

insert into #t values ('AL20',     'AC', '1/14/2010 4:31');  
insert into #t values ('AL20',     'AC', '1/14/2010 4:32');  
insert into #t values ('AL20',     'AC', '1/14/2010 4:32');  
insert into #t values ('AL20',     'DE', '1/14/2010 4:34');  
insert into #t values ('AL20',     'DE', '1/14/2010 4:46');  
insert into #t values ('AL20',     'DE', '1/14/2010 4:46');  
insert into #t values ('AL20',     'DE', '1/14/2010 4:46');  
insert into #t values ('AL20',     'DE', '1/14/2010 4:46');  
insert into #t values ('AL20',     'DE', '1/14/2010 4:46');  
insert into #t values ('AL20',     'DE', '1/14/2010 4:46');  
insert into #t values ('AL20',     'DE', '1/14/2010 4:46');  
insert into #t values ('AL20',     'DE', '1/14/2010 4:46');  
insert into #t values ('AL20',     'DE', '1/14/2010 4:46');  
insert into #t values ('AL20',     'DE', '1/14/2010 4:46');  
insert into #t values ('AL20',     'AC', '3/4/2010 4:31');   
insert into #t values ('AL20',     'AC', '3/4/2010 4:39');   
insert into #t values ('AL20',     'AC', '3/21/2010 13:24'); 
insert into #t values ('AL20',     'AC', '2/4/2011 13:54');  
insert into #t values ('AL20',     'AC', '2/4/2011 14:14');  
insert into #t values ('AL20',     'AC', '2/4/2011 14:27');  
insert into #t values ('AL20',     'DE', '2/11/2011 16:24'); 


select
    distinct PROJ_NBR, STATUS, [start], [end]
from(
select 
*,
(select MIN(START_DT) 
    from #t t1 
    where t.STATUS = t1.STATUS 
    and t.START_DT >= t1.START_DT 
    and not exists ( select * from #t t2 
                    where t.STATUS <> t2.STATUS  
                    and  t.START_DT >= t2.START_DT
                    and t1.START_DT <= t2.START_DT)
) [start],
(select Max(START_DT) 
    from #t t1 
    where t.STATUS = t1.STATUS 
    and t.START_DT <= t1.START_DT 
    and not exists ( select * from #t t2 
                    where t.STATUS <> t2.STATUS  
                    and  t.START_DT <= t2.START_DT
                    and t1.START_DT >= t2.START_DT)
) [end]

from #t t
) t1
order by [start]

I think besides column and table names this can easily adapted to Oracle too.

Oracle Analytics Solution:

SELECT Proj_Nbr, Status, MIN(Start_Dt) AS "BEGIN", MAX(Start_Dt) AS "END"
FROM (
    SELECT Proj_Nbr, Status, Start_Dt, 
       MAX(SectionID) OVER (ORDER BY Start_Dt ROWS UNBOUNDED PRECEDING) SectionID
    FROM
    (
        SELECT Proj_Nbr, Status, Start_Dt, 
           CASE WHEN Proj_Nbr = Lag(Proj_Nbr) OVER (ORDER BY start_dt)
              AND Status = Lag(Status) OVER (ORDER BY start_dt)
                 THEN null
              ELSE rowid
           END SectionID  
        FROM t1
    )
)
GROUP BY Proj_Nbr, Status, SectionID;

Data Generation:

CREATE TABLE t1 (PROJ_NBR VARCHAR2(4),  STATUS VARCHAR2(2), START_DT DATE);

INSERT INTO t1 VALUES ('AL20', 'AC', to_date('01/14/2010 04:31', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'AC', to_date('01/14/2010 04:32', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'AC', to_date('01/14/2010 04:32', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'DE', to_date('01/14/2010 04:34', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'DE', to_date('01/14/2010 04:46', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'DE', to_date('01/14/2010 04:46', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'DE', to_date('01/14/2010 04:46', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'DE', to_date('01/14/2010 04:46', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'DE', to_date('01/14/2010 04:46', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'DE', to_date('01/14/2010 04:46', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'DE', to_date('01/14/2010 04:46', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'DE', to_date('01/14/2010 04:46', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'DE', to_date('01/14/2010 04:46', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'DE', to_date('01/14/2010 04:46', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'AC', to_date('03/04/2010 04:31', 'MM/DD/YYYY HH24:MI')); 
INSERT INTO t1 VALUES ('AL20', 'AC', to_date('03/04/2010 04:39', 'MM/DD/YYYY HH24:MI')); 
INSERT INTO t1 VALUES ('AL20', 'AC', to_date('03/21/2010 13:24', 'MM/DD/YYYY HH24:MI')); 
INSERT INTO t1 VALUES ('AL20', 'AC', to_date('02/04/2011 13:54', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'AC', to_date('02/04/2011 14:14', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'AC', to_date('02/04/2011 14:27', 'MM/DD/YYYY HH24:MI'));
INSERT INTO t1 VALUES ('AL20', 'DE', to_date('02/11/2011 16:24', 'MM/DD/YYYY HH24:MI')); 

Here’s my take: (Thanks to Leigh for providing the data generation script)

select proj_nbr, status, min(start_dt) start_dt, end_dt 
from (
    select a.*, (
        select max(p.start_dt) 
        from t1 p 
        where P.PROJ_NBR = a.proj_nbr and p.status = a.status 
        and p.start_dt >= a.start_dt
        and not exists (
            select null from t1 q where q.PROJ_NBR = a.proj_nbr 
            and q.status <> p.status 
            and q.start_dt < p.start_dt and q.start_dt > a.start_dt)
    ) end_dt 
    from t1 a
)
group by proj_nbr, status, end_dt

I think adding the following index will improve performance:

create index on t1 (proj_nbr, status, start_date)

SELECT PROJ_NBR, STATUS, MIN(START_DT) AS _BEGIN, MAX(START_DT) AS _END 
FROM MYTABLE GROUP BY PROJ_NBR, STATUS;

Why the _? Because BEGIN and END are reserved words.

Leave a Reply

Your email address will not be published. Required fields are marked *