About 6 months ago, we upgraded a large .NET project from SQL server 2000 to 2012. (i know it’s 2 versions behind already, don’t get me started!)
One of the stored procedures ran a query in the overnight batch process that took 55 mins since the upgrade, but jumped up to 3 hours last week. Prior to the upgrade however, the query took 1 minute. I added an index on the table as per the execution plan and it brought the run time down to 10 mins. I’m not a DBA, so I’m wondering if someone could explain why this is, as it doesn’t make sense to me.
This was the query that caused the problems:
DELETE @ExcludedPersons FROM @ExcludedPersons WHERE personID NOT IN (SELECT personID FROM PersonStudy ps WHERE (ps.StudyTypeID = 1) AND (ps.StudyClosedDate IS NULL))
There are 2 indexes on the PersonStudy table:
- A clustered index on StudyID which is the primary key.
- A non-clustered index on PersonID which is also unique.
These indexes were just carried forward from SQL 2000. There are about 60,000 rows in the table. There are about 500,000 rows in the @ExcludedPerson table variable (but only 1 column).
For the last few months, the sub-query in the where clause returned 1 row and the query was taking ablout 55 mins. Last week, the final person had a date added to StudyClosedDate so the subquery returned 0 rows and that was when the time jumped up to 3 hours.
The Index that execution plan suggested was to add in an index with the Keys of StudyTypeID and StudyClosedDate and include PersonID and it was this that brought processing time to 10 mins.
So, my question is this:
Why is there this much processing time on a subquery that returns 0 rows and why does the index make such a big difference?
And why the jump in processing time between the sub-query returning 1 row and 0 rows? Is it do do with the different way indexes are handled in sql 2000 and 2012?
The unstated assumption in the question is that the subquery is executed first, then the outer
DELETE is processed. This is not how things work. People write queries that express a logical requirement, then the SQL Server query optimizer tries to find an efficient physical implementation.
The optimizer’s decisions are driven by cost estimates for the various possible physical options it explores.
Garbage In, Garbage Out
By using a table variable, the current arrangement deprives the optimizer of two important pieces of information: the number of rows in the table (cardinality); and the distribution of those values (statistics).
In most cases, the optimizer is unable to see the cardinality of a table variable, and guesses at one row. The physical execution strategy it chooses on the basis that there is one row in the table variable is very likely suboptimal for half a million rows.
Assuming one row, the optimizer may well decide that scanning the
PersonStudy table looking for matches is a good enough strategy:
In practice, this plan results in the
PersonStudy table being scanned 500,000 times at runtime (once per row in the table variable). That is potentially 60,000 * 500,000 = 30 billion rows. No wonder it takes a while.
Given incorrect or incomplete information about the data, the chances are pretty high that the optimizer will deliver a poor execution plan.
- Use a temporary table (e.g.
#ExcludedPersons) instead of a table variable. This will provide accurate cardinality information, and allow SQL Server to automatically create statistics.
- Constrain the
personIDcolumns to be
NOT NULL. This gives the optimizer useful information and will allow it to avoid a common problem with
- Make the
personIDcolumn in the temporary table (or table variable) the
PRIMARY KEY. Again, this provides useful information to the optimizer (uniqueness, ordering, not null).
- Provide a useful index on the
PersonStudytable. The suggested index is a reasonable choice, but there may be better options. A good index provides the optimizer with a more efficient data access path.
Especially if you are unable to switch to using a temporary table, test the following (but still add the constraints and indexes/keys mentioned above):
- Add an
OPTION (RECOMPILE)hint to the query. This will allow the optimizer to see the cardinality of the table variable (but not statistical distribution) at runtime.
- Or: Use an
OPTION (HASH JOIN)hint. Hash join scales better than nested loops with a table scan. The hash join may well spill and reverse roles at runtime, but this should still be very significantly better than what you have right now.
- Or: If your workload often uses table variables with a significant number of rows, test the impact of enabling trace flag 2453. This will expose cardinality as above, without the (typically small) overhead of a plan recompilation.
The reason the index is going to help in your sub query is because the engine is going to go directly to the values you’re specifying. Without the index the engine is most likely scanning through the entire table before returning any rows.
Of the top of my head I can’t think of why it would take 3 hours when it’s not deleting a single row, but I wouldn’t trust a table variable for this.
Lastly, would you get better performance if you select with an inner join (into a new temp table) and then replace the existing one? Rather than loading and then deleting a bunch of records?