Question :
I got a bit stuck on performance, so I thought about asking a bit of help. I have a working query, but unfortunatelly it feels rather slow and I know it’s not as performant as it could be made (easily?). The situation is to delete rows from a table which contain around one million rows, by doing a join to another table likewise having around a million rows worth of data (these tables have about the same amount of rows). The result after removing rows from SomeList
is that there’s about 1 % less rows
The table structure is as follows
CREATE TABLE SomeList
(
Id INT IDENTITY(1,1) PRIMARY KEY,
TimeData DateTime NOT NULL
);
CREATE TABLE SomeListAuxData
(
Id INT IDENTITY(1,1) PRIMARY KEY,
CountData INT NOT NULL,
SomeListId INT FOREIGN KEY REFERENCES SomeList(Id) ON DELETE CASCADE
);
And the query to delete as follows
DELETE
FROM SomeList
WHERE TimeData < @someTime AND Id = @someListId AND
(
SELECT COUNT(*)
FROM SomeListAuxData
WHERE SomeList.Id = SomeListAuxData.SomeListId
) > 0
AND
(
SELECT MAX(CountData)
FROM SomeListAuxData
WHERE SomeList.Id = SomeListAuxData.SomeListId
) < @someValue;
Specifically this seem wasteful since I’m doing two subqueries, but I’m not sure how to go about the COUNT(*)
and MAX(CountData)
parts if I’d try joining, for instance.
This query is made in a program code where the someListIds are looped with same the @someValue
(e.g. 2000
) and @someTime (e.g. '2013-11-07 09:00:00.000'
) dates, so that too is rather slow. The query could be done for all the SomeList
rows at once.
<edit: Also, I just learned that the code calling does have a list of pairs of type (@someListId
, @someDate
), which means the date isn’t a constant. The input comes from the user interfaces and typically there may be even hundreds of such pairs.
Also, maybe scheduling a job at night may do the job. Though I’ll need to checks Craig’s advice which seem to be valid too. 🙂
<edit: I clarified what I meant by the “1 %” in my question. The point to make was that I don’t think the data warrants collecting the non-removable rows to a #temptable
, truncating SomeList
and then moving the data from #temptable"
back to SomeList
. Craig’s good point answered (partially) a different question, nevertheless it may be applicable. Sorry for my sloppy writing.
<edit 2013-11-08: The join conditions corrected as suggeted in comments.
Answer :
You can create an inline function which does return only these rows which are to be deleted:
CREATE FUNCTION [uf_deletable_SomeLists]
(
@someTime DATETIME,
@someValue INT
)
RETURNS TABLE
AS
RETURN
(
SELECT sl.[Id]
FROM [SomeList] AS sl
INNER JOIN [SomeListAuxData] AS sla
ON sl.[Id] = sla.[SomeListId]
AND sl.[TimeData] < @someTime
GROUP BY sl.[Id]
HAVING MAX(sla.[CountData]) < @someValue
)
If you would call the function on the following table set up
SELECT * FROM [SomeList]
Id TimeData
----------- -----------------------
2 2013-11-06 16:55:50.280
3 2013-11-06 16:55:56.667
4 2013-11-06 16:56:24.217
SELECT * FROM [SomeListAuxData]
Id CountData SomeListId
----------- ----------- -----------
2 1 2
4 5 2
3 5 4
5 10 4
the result would look like this
SELECT * FROM [uf_deletable_SomeLists]('20140101', 10)
Id
-----------
2
Now you can combine the inline function with a delete statement using a join
DELETE [SomeList]
FROM [SomeList] AS sl
INNER JOIN [uf_deletable_SomeLists](@someTime, @someValue) AS d
ON sl.[Id] = d.[Id]
AND sl.[Id] = @someListId
If you want you can remove the @someListId, that will use the whole SomeList table.
Or you add a second join with a temporary table containing SomeList Id’s which you may want to delete.
You can include the percentage statement mentioned by Craig Efrein as well…
DELETE TOP(1) PERCENT [SomeList]
FROM [SomeList] AS sl (...)
To boost the select you could consider using an index.
For my machine the execution plan was a bit better if I used the following index.
CREATE NONCLUSTERED INDEX [ix_SomeListAuxData_SomeListId] ON [SomeListAuxData]
( [SomeListId] ASC )
INCLUDE ( [CountData] )
UPDATE:
You can always take the query from the function and use it directly as a nested one for the delete statement. But creating the function gives you the option to reuse the query for example to check beforehand which would be deleted if you would call the delete. Incidentally the inline function is basically just a server-side stored parameterised query (I would have used a view if @someTime and @someValue would have been constants). The benefits of inline functions have been discussed in other posts or pages, i.e. see here.
Since you are already referencing the SomeList table from the SomeListAuxData table, why not add “ON DELETE CASCADE” to the foreign constraint definition on SomeListAuxData?
CREATE TABLE SomeListAuxData
(
Id INT IDENTITY(1,1) PRIMARY KEY,
CountData INT NOT NULL,
SomeListId INT FOREIGN KEY REFERENCES SomeList(Id) ON DELETE CASCADE
);
This way, if you delete record with an id of 1 in the SomeList table, any record in SomeListAuxData referencing record 1 will be deleted as well.
As to deleting a % of rows at a time, I like to use a loop. This one deletes 10,000 records at a time, until there are none left.
delete top (10000) from SomeList WHERE TimeData < @someTime AND Id = @someListId
WHILE @@ROWCOUNT > 0
BEGIN
waitfor delay '00:00:05' -- add in a delay optional
delete top (10000) from SomeList WHERE TimeData < @someTime AND Id = @someListId
END
Update 14:09
Using the TOP operator, you can also specify a percent of rows.
DELETE TOP(1) PERCENT
FROM SomeList
WHERE TimeData < @someTime AND Id = @someListId AND
(
SELECT COUNT(*)
FROM SomeListAuxData
WHERE SomeListAuxData.Id = SomeListAuxData.SomeListId
) > 0
AND
(
SELECT MAX(CountData)
FROM SomeListAuxData
WHERE SomeListAuxData.Id = SomeListAuxData.SomeListId
) < @someValue;