I have a simple table in SQL Server 2012 which implements a processing queue. As data has been inserted the query to retrieve the next item has gone from <100ms to a constant 5-6secs. I would be hugely grateful if someone could point me at the cause of this sudden dip in performance. (It seems to have been an almost overnight drop).
Here’s the table definition:
CREATE TABLE [dbo].[messagequeue] ( [id] INT IDENTITY (1, 1) NOT NULL, [testrunident] VARCHAR (255) NOT NULL, [filesequence] INT NOT NULL, [processed] BIT NOT NULL, [dateentered] DATETIME NULL, [filedata] VARBINARY (MAX) NULL, [retries] INT NOT NULL, [failed] BIT NOT NULL, [msgobject] VARBINARY (MAX) NULL, [errortext] VARCHAR (MAX) NULL, [sourcefilename] VARCHAR (MAX) NULL, [xmlsource] VARCHAR (MAX) NULL, [messagetype] VARCHAR (255) NULL ); CREATE NONCLUSTERED INDEX [messagequeue_sequenc_failed_idx] ON [dbo].[messagequeue]([processed] ASC, [failed] ASC) INCLUDE([id], [testrunident], [filesequence]); CREATE NONCLUSTERED INDEX [messagequeue_sequence_idx] ON [dbo].[messagequeue]([testrunident] ASC, [processed] ASC) INCLUDE([filesequence]); CREATE UNIQUE NONCLUSTERED INDEX [IXd_testrun_sequence] ON [dbo].[messagequeue]([testrunident] ASC, [filesequence] ASC);
And here’s the query used to retrieve the next row to be processed:
select messagequeue.id, messagequeue.testrunident, messagequeue.filesequence, messagequeue.processed, messagequeue.filedata, messagequeue.retries, messagequeue.failed, messagequeue.msgobject, messagequeue.xmlsource from messagequeue where id = ( select top 1 id from messagequeue mqouter where processed = 0 AND failed = 0 AND (filesequence = 0 OR filesequence = ( select max (filesequence) + 1 from messagequeue mqinner where mqinner.testrunident = mqouter.testrunident and mqinner.processed = 1 ) ) order by testrunident, filesequence )
There are multiple rows with the same
testrunident, each has a
filesequence which should be sequential, however some may be missing so the query should only return the NEXT row where the previous row has
processed = 1 or
filesequence = 0 which denotes this is the first row within a
Here’s an SQLFiddle to give an idea: SQL Fiddle
Query plan: Query Plan XML
Is there a better way to write the query?
EDIT 1 – Example of ensuring previous row was processed before selecting a row:
Where `id` = testrunident and `fs` = filesequence id | fs | processed 1 | 0 | 1 1 | 1 | 1 1 | 2 | 1 1 | 4 | 0 -- this shouldn't be next as no row with seqeuence = 3 and processed = 1 2 | 0 | 0 --this should be the next row 2 | 1 | 0
Taking just the core part that identifies the
id of the row to return, the following query encapsulates the logic needed:
SELECT TOP (1) MQ.id FROM dbo.messagequeue AS MQ WHERE -- Current row MQ.processed = 0 AND MQ.failed = 0 AND ( EXISTS ( -- Previous row in strict sequence SELECT * FROM dbo.messagequeue AS MQ2 WHERE MQ2.testrunident = MQ.testrunident AND MQ2.processed = 1 AND MQ2.failed = 0 AND MQ2.filesequence = MQ.filesequence - 1 ) OR MQ.filesequence = 0 ) ORDER BY MQ.testrunident ASC, MQ.filesequence ASC;
Executing this query efficiently needs a small change to an existing index, whose definition is currently:
CREATE NONCLUSTERED INDEX [messagequeue_sequenc_failed_idx] ON [dbo].[messagequeue]([processed] ASC, [failed] ASC) INCLUDE([id], [testrunident], [filesequence]);
The change involves moving
filesequence from the
INCLUDE list to the index keys. The new index still supports all the queries the old one did, and as a side-effect of the change, the redefined index can now be marked as
UNIQUE. The following script will perform this change (you can perform this operation
ONLINE if you are running Enterprise Edition):
CREATE UNIQUE NONCLUSTERED INDEX [messagequeue_sequenc_failed_idx] ON [dbo].[messagequeue] ( [processed] ASC, [failed] ASC, testrunident ASC, filesequence ASC ) INCLUDE ( [id] ) WITH ( DROP_EXISTING = ON --, ONLINE = ON );
With this index in place, the execution plan for the revised query is:
To return data from the identified row, the final query is a simple extension:
SELECT MQ3.id, MQ3.testrunident, MQ3.filesequence, MQ3.processed, MQ3.filedata, MQ3.retries, MQ3.failed, MQ3.msgobject, MQ3.xmlsource FROM dbo.messagequeue AS MQ3 WHERE MQ3.id = ( SELECT TOP (1) MQ.id FROM dbo.messagequeue AS MQ WHERE MQ.processed = 0 AND MQ.failed = 0 AND ( EXISTS ( SELECT * FROM dbo.messagequeue AS MQ2 WHERE MQ2.testrunident = MQ.testrunident AND MQ2.filesequence = MQ.filesequence - 1 AND MQ2.processed = 1 AND MQ2.failed = 0 ) OR MQ.filesequence = 0 ) ORDER BY MQ.testrunident ASC, MQ.filesequence ASC );
There is another option because you are using SQL Server 2012, which introduced the
LEAD window functions:
SELECT TOP (1) ML.id FROM ( SELECT M.id, M.testrunident, M.filesequence, M.processed, M.failed, PreviousProcessed = LAG(M.processed) OVER ( ORDER BY M.testrunident, M.filesequence), PreviousFailed = LAG(M.failed) OVER ( ORDER BY M.testrunident, M.filesequence), PreviousFileSequence = LAG(M.filesequence) OVER ( ORDER BY M.testrunident, M.filesequence) FROM dbo.messagequeue AS M ) AS ML WHERE -- Current row ML.processed = 0 AND ML.failed = 0 -- Previous row in strict order AND ML.PreviousProcessed = 1 AND ML.PreviousFailed = 0 AND ML.PreviousFileSequence = ML.filesequence - 1 ORDER BY ML.testrunident, ML.filesequence;
This query also needs an adjustment to an existing index, this time by adding
failed as included columns:
CREATE UNIQUE NONCLUSTERED INDEX [IXd_testrun_sequence] ON [dbo].[messagequeue] ( [testrunident] ASC, [filesequence] ASC ) INCLUDE ( [processed], [failed] ) WITH ( DROP_EXISTING = ON --, ONLINE = ON );
With that index in place, the execution plan is:
I should also mention that your general approach would be unsafe if more than one process were ever working on the same queue at once. For more information about this and general queue table design see Remus Rusanu’s excellent article Using Tables As Queues.
The original execution plan showed a big difference between the number of rows SQL Server expected to process, versus the number actually encountered during execution. Opening the plan with SQL Sentry Plan Explorer shows these differences clearly:
SQL Server chose an execution strategy that would have worked well if the number of rows really had been as small as it had estimated:
Unfortunately, the chosen strategy did not scale well when the estimates turned out to be far too low. Parts of the execution plan were actually executed 458,260 times – not a recipe for instant results. Had the SQL Server optimizer known the true numbers, it likely would have chosen a different strategy.
It is tempting to think that a discrepancy between estimated and actual row counts must be due to inaccurate statistical information, but chances are you have reasonably up-to-date single-column automatic statistics on these tables. Improvements might be made by providing additional statistical information, but the root cause of the inaccurate estimates is something quite different in this case.
By default, when SQL Server chooses an execution plan for a query it assumes all potential result rows will be returned to the client. However, when SQL Server sees a
TOP operator, it rightly takes the number of rows specified into account when estimating cardinality. This behaviour is known as setting a row goal.
Essentially, a row goal means the optimizer scales back estimates under the Top operator to reflect the fact that fewer rows will be needed than the full potential size of the result set. This scaling implicitly assumes that values of interest are evenly distributed within the set.
For example, say you have a table with 1000 rows, of which 10 satisfy some query predicate. Using statistics, SQL Server knows that 1% of the rows meet the criteria. If you write a query to return the first row that matches, SQL Server assumes that it will need to read 1% of the table (= 10 rows) in order to find that first match. In the worst case, the ten rows that meet the criteria might appear last (in whatever order they are searched) so 990 rows that do not match will be read before SQL Server encounters the one you want.
The improved query (using the improved index) still suffers from this problem to a certain extent, but the effects are much less marked:
The fundamental question we are asking of the optimizer’s row goal logic here is: how many rows do we need to check before we find the first one (according to the order by clause specification) that is in sequence, unprocessed, and not marked as having failed. This is a tough question to answer based on the limited statistics SQL Server keeps.
In fact there is very little we can do to communicate all nuances of this complex problem to the optimizer effectively. The best we can do is to provide it with an efficient index that we expect to locate the qualifying row as early as possible.
In this case, that means providing an index that returns unprocessed, un-failed entries in
order by clause order. We expect to have to check relatively few of these before finding the first one that is in sequence (i.e. the preceding row exists and is marked as processed and not marked as failed).
Both solutions shown above eliminate the Key Lookup operations seen in the original query because the new index now includes (covers) all needed columns. Also, the new index is likely to find the target row more quickly. The original execution plan scanned index
IXd_testrun_sequence in (
testrunident, filesequence) order, meaning it would encounter old test runs first, most of which will be marked processed. We are looking for unprocessed rows in the outer query, so this strategy was inefficient (we end up performing the sequence check on 458,260 rows).
Finally, checking for a particular sequence value is much more efficient than finding the maximum of a potentially large set. This is the difference I have highlighted in the previous code as looking for the previous row in strict sequence. There is a semantic difference between these two queries and the solution using
MAX shown in the question. My understanding is that you are interested in the first matching row; that row need not be the highest
filesequence for the