Question :
I have a table with 20M rows, and each row has 3 columns: time
, id
, and value
. For each id
and time
, there is a value
for the status. I want to know the values of the last and the next periods for a specific time
and id
, and have the following query to get the values:
update a1
set a1.value_last = b1.value,
a1.value_next = c1.value
from tab1 a1
left join tab1 b1
on a1.id = b1.id
and a1.period = b1.period + 1
left join tab1 c1
on a1.id = c1.id
and a1.period = c1.period - 1
It seems that the query takes forever and the log file increased by more than 10 GB. I’m wondering what’s the most efficient way to write this query? I know using index will speed up the joining process, but how can I reduce the logging?
I’m using SQL Server 2016 on Win10 64bit.
Answer :
update a1
set a1.value_last = LAG(value, 1,0) OVER (partition by id ORDER BY period)
, a1.value_next = LEAD(value, 1,0) OVER (partition by id ORDER BY period)
from tab1 a1
index on id, period
or just use a view – that might surprise you
CREATE VIEW tab1LastNext
AS
select a1.id, a1.period, a1.value
, LAG(value, 1,0) OVER (partition by id ORDER BY period) as value_last
, LEAD(value, 1,0) OVER (partition by id ORDER BY period) as value_next
from tab1 a1;
for logging
need to hope value is not null or this gets messy
select 1;
while @@rowcount > 0
begin
update top (10000) a1
set a1.value_last = LAG(value, 1,0) OVER (partition by id ORDER BY period)
from tab1 a1
where LAG(value, 1,0) OVER (partition by id ORDER BY period) is not null
and LAG(value, 1,0) OVER (partition by id ORDER BY period) != a1.value_last
end
select 1;
while @@rowcount > 0
begin
update top (10000) a1
set a1.value_next = LEAD(value, 1,0) OVER (partition by id ORDER BY period)
from tab1 a1
where LEAD(value, 1,0) OVER (partition by id ORDER BY period) is not null
and LEAD(value, 1,0) OVER (partition by id ORDER BY period) != a1.value_next
end
If I understand the question, you don’t need to update anything — it’s just the means you’re using to support a SELECT statement that isn’t shown here.
In that case, this should get you there without needing an UPDATE:
SELECT id, period, value
, LAG([value], 1, NULL) OVER (PARTITION BY id ORDER BY id, [period])
AS value_last
, LEAD([value], 1, NULL) OVER (PARTITION BY id ORDER BY id, [period])
AS value_next
FROM tab1
This code will work on SQL Server 2012 (Compatibility Level 110) and higher.
With the PARTITION BY id
as part of the LAG and LEAD functions, keep in mind the first row in the partition will have NULL for its value_last
(because there is no previous row in the partition), and the last row in the partition will have NULL for its value_next
(because there’s no next row in the partition). If you want to change those to some other value — maybe zero — that value goes where the NULL is in the LAG([value], 1, NULL)
part of the code.
By using these window functions (like @Jorriss suggests), you’re eliminating the need for an UPDATE statement, and the logging that comes with it.
Take a look at SQL Server Window Functions, specifically LAG
and LEAD
. These functions allow the query to get data in the resultset without self-joins. Here’s an example from the StackOverflow data dump.
SELECT p.ID,
LAG (p.ID, 1) OVER (ORDER BY p.OwnerUserId DESC) AS PreviousID,
LEAD (p.ID, 1) OVER (ORDER BY p.OwnerUserId DESC) AS NextID
FROM Posts p
JOIN Users u ON u.Id = p.OwnerUserId
WHERE DisplayName = 'CecilPhillip'
AND p.PostTypeId = 1
AND p.OwnerUserId = 333082
LAG (p.ID, 1)
will look for the ID
in the previous row.
LEAD (p.ID, 1)
will look for the ID
in the next row.
ID | PreviousID | NextID
----------------------------------
15231460 | NULL | 3037486
3037486 | 15231460 | 10018831
10018831 | 3037486 | 11043750
11043750 | 10018831 | 9577679
9577679 | 11043750 | 9597766
9597766 | 9577679 | 24660128
24660128 | 9597766 | 5722741
5722741 | 24660128 | 18713274
18713274 | 5722741 | 3679697
3679697 | 18713274 | NULL
You’ll notice that if there isn’t a value found it will return a NULL
.