The most efficient way to join huge tables

Posted on

Question :

I have a table with 20M rows, and each row has 3 columns: time, id, and value. For each id and time, there is a value for the status. I want to know the values of the last and the next periods for a specific time and id, and have the following query to get the values:

update a1
set  a1.value_last = b1.value,   
     a1.value_next = c1.value
from tab1 a1
left join tab1 b1
on a1.id = b1.id
and a1.period = b1.period + 1
left join tab1 c1
on a1.id = c1.id
and a1.period = c1.period - 1

It seems that the query takes forever and the log file increased by more than 10 GB. I’m wondering what’s the most efficient way to write this query? I know using index will speed up the joining process, but how can I reduce the logging?

I’m using SQL Server 2016 on Win10 64bit.

Answer :

update a1
set  a1.value_last = LAG(value, 1,0)  OVER (partition by id ORDER BY period)
  ,  a1.value_next = LEAD(value, 1,0) OVER (partition by id ORDER BY period)
from tab1 a1

index on id, period

or just use a view – that might surprise you

CREATE VIEW tab1LastNext  
AS  
select a1.id, a1.period, a1.value
     , LAG(value, 1,0)  OVER (partition by id ORDER BY period) as value_last
     , LEAD(value, 1,0) OVER (partition by id ORDER BY period) as value_next
from tab1 a1;

for logging
need to hope value is not null or this gets messy

select 1;
while @@rowcount > 0
begin 
  update top (10000) a1
  set a1.value_last = LAG(value, 1,0) OVER (partition by id ORDER BY period)
  from tab1 a1 
  where LAG(value, 1,0) OVER (partition by id ORDER BY period) is not null 
    and LAG(value, 1,0) OVER (partition by id ORDER BY period) != a1.value_last
end 
select 1;
while @@rowcount > 0
begin 
  update top (10000) a1
  set a1.value_next = LEAD(value, 1,0) OVER (partition by id ORDER BY period)
  from tab1 a1 
  where LEAD(value, 1,0) OVER (partition by id ORDER BY period) is not null 
    and LEAD(value, 1,0) OVER (partition by id ORDER BY period) != a1.value_next
end 

If I understand the question, you don’t need to update anything — it’s just the means you’re using to support a SELECT statement that isn’t shown here.

In that case, this should get you there without needing an UPDATE:

SELECT id, period, value
, LAG([value], 1, NULL) OVER (PARTITION BY id ORDER BY id, [period]) 
    AS value_last
, LEAD([value], 1, NULL) OVER (PARTITION BY id ORDER BY id, [period]) 
    AS value_next
FROM tab1

This code will work on SQL Server 2012 (Compatibility Level 110) and higher.

With the PARTITION BY idas part of the LAG and LEAD functions, keep in mind the first row in the partition will have NULL for its value_last (because there is no previous row in the partition), and the last row in the partition will have NULL for its value_next (because there’s no next row in the partition). If you want to change those to some other value — maybe zero — that value goes where the NULL is in the LAG([value], 1, NULL) part of the code.

By using these window functions (like @Jorriss suggests), you’re eliminating the need for an UPDATE statement, and the logging that comes with it.

Take a look at SQL Server Window Functions, specifically LAG and LEAD. These functions allow the query to get data in the resultset without self-joins. Here’s an example from the StackOverflow data dump.

SELECT p.ID,
       LAG (p.ID, 1) OVER (ORDER BY p.OwnerUserId DESC) AS PreviousID,
       LEAD (p.ID, 1) OVER (ORDER BY p.OwnerUserId DESC) AS NextID
FROM   Posts p
JOIN   Users u ON u.Id = p.OwnerUserId
WHERE  DisplayName = 'CecilPhillip'
AND    p.PostTypeId = 1
AND    p.OwnerUserId = 333082 

LAG (p.ID, 1) will look for the ID in the previous row.
LEAD (p.ID, 1) will look for the ID in the next row.

ID       | PreviousID | NextID
----------------------------------
15231460 |       NULL |   3037486
3037486  |   15231460 |  10018831
10018831 |    3037486 |  11043750
11043750 |   10018831 |   9577679
 9577679 |   11043750 |   9597766
 9597766 |    9577679 |  24660128
24660128 |    9597766 |   5722741
 5722741 |   24660128 |  18713274
18713274 |    5722741 |   3679697
 3679697 |   18713274 |      NULL

You’ll notice that if there isn’t a value found it will return a NULL.

Leave a Reply

Your email address will not be published. Required fields are marked *