I hope you guys can help me here.
Our application polls a message table every 3 seconds looking for notifications to send out. This works well on all our Customers (single-tenant DB) except one. They will have no activity for 23hours a day and then load thousands of messages all at once (3000+). In other cases, this volume is nothing and we can easily deal with it, except in this case, the SQL query below takes approximately 30 seconds to run and gets worse as the queue backs up as it does an update, require an exclusive lock and therefore blocking all the other queries and so the issues causes all kinds of havoc.
This is all due to a bad query plan.
We have daily reindexing that runs every morning at 5am (reorganise < 30%, rebuild > 30%, ignore <5%) as well as updates statistics. These are both from the Ola Hallengren maintenance solution.
We are also on SQL Server 2016 and are fully up to date (13.0.5492.2)
I don’t have the 2 plans handy, but basically the bad plan goes and does a full table scan of the MessagesSent table (3.5m rows).
My theory is that because the query returns nothing all day, certain parts are not executed and therefore the bad query is the most efficient query for SQL.
This will continue after flushing the plan for the query as it just generates the same plan, however when I UPDATE STATISTICS on the MessagesSent table, the good plan is created and all is healthy, with the query executing in about 10-30ms.
Does anyone know how I can fine-tune this to always use the better plan even if no data exsists for the query to return?
As a hotfix, we have added option recompile to the application but I don’t feel that’s the ideal solution for a query that is executing ever 3 seconds.
Here is the query :
WITH TopMessage AS ( SELECT TOP 1 ID, BatchID FROM MessagesSent JOIN Units ON Unit = idUnit WHERE MessageDate <= GETDATE() AND Active = 'True' AND Status = 'Queued' AND NOT(DialString = 'null') AND Unit = ('29') AND System in ('SystemName', 'Q1', '') ORDER BY CASE WHEN QPriority IS NULL THEN CASE WHEN DefaultPriority IS NULL THEN 999999 ELSE DefaultPriority END ELSE QPriority END ASC, Retries ASC, MessageDate ASC, ID ASC ), BatchCalls AS ( SELECT * FROM MessagesSent WHERE ( (LEN(BatchID) > 0 AND BatchID = (SELECT TOP 1 BatchID FROM TopMessage) ) OR ID = (SELECT TOP 1 ID FROM TopMessage) ) AND Status = 'Queued' AND Active = 'True' ) UPDATE BatchCalls SET LastUpdated = @dtNow OUTPUT INSERTED.* WHERE Status = 'Queued'
Thank you very much for your time and for looking.
You could use a manual plan guide to enforce the desired plan.
Alternatively, you could use Query Store to enforce plan guides via the GUI.
Also, you could run an update statistics job after the big load of messages. I wrote a post on my blog showing an easy way to do that.
Since you have some common filtering between those queries, you might be able to speed things up, and reduce the amount of locking / blocking going on, by adding a filtered index on the “Active” and “Status” columns.
It’s difficult to tell which columns belong to which tables without the schema, but the index would look something like this:
CREATE NONCLUSTERED INDEX IX_BatchID_Filtered ON dbo.MessagesSent (BatchID, Active, Status) WHERE Active = 'True' AND Status = 'Queued';
Note: there might be a better leading column than BatchID, and you might want to include other columns from the table (like DialString, System, QPriority, etc) – I just wasn’t which columns belonged to which tables, and what their data types are, etc.
This index would only include the (hopefully small) subset of rows that meet those criteria, and provide more predictable performance in the face of somewhat out-of-date statistics.