Incorrect Statistics EQ_ROWS Causing Parallel Table Scans and Deadlocks

Posted on

Question :

On one customer’s system, one of our database’s tables has about 123 million rows. This table is a transaction table that is grouped by a batch key. The clustered index is 4 columns: batch key, transaction version, shift number, transaction number. One batch can have any number of transactions in it, but a query to count by batch shows the maximum number of transactions per batch is 764, and the count drops sharply from that peak.

Some of our queries that join to this transaction table are randomly getting bad query plans that choose to perform parallel index scans on the transaction table. Many of these queries run in separate connections. When this occurs, the parallel scans are causing constant deadlocking on this table.

The query plan shows a very large estimate for the number of rows in a batch– about 20,000. I already know the maximum per batch is 764, so I was puzzled where this estimate comes from.

I ran DBCC SHOW_STATISTICS ('MyTable', 'MyTableClusteredIndex'). The statistics histogram that came back looks weird, because the EQ_ROWS numbers are incorrect. For example, here are the first 3 entries in the histogram. The EQ_ROWS numbers are very different than the actual number of rows found with a query using the same key (shown in the right-most column):

58287 0 1 0 1 121
78045 664341.4 14512.91 1308 507.9888 134
86487 569858.3 14512.91 1208 471.8449 118
94282 574411.7 9070.57 538 1067.721 128

Since this is a production server, I do not want to try to mess with the statistics during business hours.

We have an internal dev server that we use for performance testing. The dev machine has 2 10-core processors on it and is using about 500 GB of memory.
On this server, the transaction table has about 1.1 billion rows. The dev server shows a larger disparity between EQ_ROWS and the actual number of rows for each key. For example, some EQ_ROWS numbers are over 60,000, but the actual number of rows for those keys are between 130 to 160 rows.

If I run a manual UPDATE STATISTICS MyTable on our dev server, the statistics do not change. The EQ_ROWS are still the same incorrect numbers.

So far, I have found the same strange EQ_ROWS behavior on SQL 2014 SP 2 and SQL 2012 SP 3.

Our database servers run with the following trace flags enabled: 1222, 2312, 2453, 4199. I verified that disabling these trace flags globally does not change the behavior when updating statistics.

I verified that the UPDATE STATISTICS MyTable is not running in parallel. When it runs, its degree of parallelism is 1, so it does not look like parallelism is causing some kind of statistics estimation bug.

Aren’t the EQ_ROWS values supposed to be the exact number of rows for that key value? If so, why are they incorrect? Does this look like a SQL Server bug?

(Note that I have ignored parameter sniffing, because this issue looks related to the statistics estimates. The query plan does not look related to parameter sniffing to me except for randomly optimizing the query for one of these keys.)

Answer :

As pointed out by sp_BlitzErik (thank you!), the problem here was the sampling rate was too low for the billion row transaction table. The more rows sampled by the statistics update, the more accurate the estimate, but the longer it takes for SQL Server to generate the statistics. There is a tradeoff between database maintenance time and statistics accuracy.

A simple `update statistics MyTable’ was sampling 0.13% of the rows. This generated a very poor estimate for the “EQ_ROWS” column that was usually hundreds of times worse than the actual row counts. For example, some EQ_ROWS estimates were over 60,000 when the actual counts were about 150. On the other hand, the default stats update completed in 25 seconds– pretty fast for a billion rows.

A full scan generated perfect EQ_ROWS statistics. However, the full scan required about 5 hours to complete. That would not be acceptable for the customer’s database maintenance window.

Below is the chart of samples vs. accuracy. We’re going to recommend the customer adjusts the sample size manually for the largest tables in the system. I expect we’ll go with a sample size of about 10%. The Avg Stat Diff column is calculated as ABS(EQ_ROWS – ActualRowCount).

Sample Size Avg Stat Diff Time to Create Stats
=============== ============= ====================
Default (0.13%) 42,755 00:00:25
1% 4,880 00:02:41
5% 940 00:13:47
10% 443 00:27:45
20% 209 00:56:28
50% 67 02:24:17
100% (fullscan) 0 ~5 hours

Leave a Reply

Your email address will not be published. Required fields are marked *