Is the Query Store’s Overall Resource Consumption report telling me my database is doing terribly or are the numbers just whacked?

Posted on

Question :

Recently we upgraded one of our production servers from SQL Server 2016 to SQL Server 2019 (CU 15).
This was a great opportunity for me to enable Query Store on our main application database.
It’s been running for a couple days now and this is what the Overall Resources Consumption Report is showing:

Overall Resources Consumption Report

In the screenshot, I cherry picked a few numbers that looked insane (for the first day the Query Store was enabled) and normalized them to easier units of measure to discuss. Things like ~183 TB of data consumed by Logical Reads, or ~5 TB of data consumed by Memory, in a single day, seem almost impossible on this server.

This database is the John Smith of databases, at only 100 GB big for the Data file and 200 GB for the Log file. At most, maybe 100 different users connect to it throughout the day, and there’s not a ton of transactions being created in a single day. The server itself only has 32 GB of Memory provisioned for it. For 5 TB of Memory to be consumed, the allocated Memory would need to be filled up over 150 times throughout the day.

The only other potentially relevant information I can think to add is after the upgrade, we immediately set the “Compatibility Level” of this database to 150 (SQL Server 2019) and left the “Legacy Cardinality Estimation” setting off. I know it’s not ideal, and better to let the dust settle while gathering baseline metrics, but part of the reason for the upgrade was to fix some urgent performance issues which these combination of settings actually worked the best for from our testing (and still seem to be working great).

Some of the previous performance issues we were having were due to insane Cardinality Estimates, which if the Query Store was using the Estimated data points for, then I could actually see this report’s numbers being correlative, but I’d have to imagine the report is using Actual data points? Though it would be interesting if this was another sign of something fundamentally wrong with how my production server / database is configured on my continuous conquest to slay Cardinality Estimate issues.

Am I reading these numbers wrong, is Query Store bugging out, or is my server toast?

Answer :

My workstation can do about 100K-150K LIO/CPU Second. A Logical IOs (LIO) is reading a single 8KB page from page cache. Ie when running a big parallel scan of a cached table with a trivial query plan I get IO stats like:

Table ‘frs_big’. Scan count 9, logical reads 519631, physical
reads 0, page server reads 0, read-ahead reads 0, page server
read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob
page server reads 0, lob read-ahead reads 0, lob page server
read-ahead reads 0.

(1 row affected)

SQL Server Execution Times: CPU time = 5109 ms, elapsed time
= 655 ms.

And with MAXDOP 1 the same scan takes only 3625 CPU ms. And it’s running Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz.

Over the period you’ve got 22,902,891,616KB of Logical IO which, each LIO is 8KB, so 22902891616/8 LIO, and 20,393,171ms of CPU time. So

with q as
(
  select 22902891616/8 lio, 20393171 cpu 
)
select 1000*lio/cpu lio_per_cpu_sec, lio/(24*60*60) lio_per_sec
from q 

or

lio_per_cpu_sec                         lio_per_sec
--------------------------------------- ---------------------------------------
140,383                                 33,134

140,383 LIO/CPU Second. So the number of logical IOs roughly corresponds to the total CPU usage, and you’re averaging 33,000 LIO/Sec over the 24 hour period.

Given that you upgraded to 2019 with all the newer optimizer behavior you probably have some bad plans that are causing excessive scanning, but the speed of the server, and the moderate size is keeping your cache hit ratio very high and keeping the perceived performance acceptable.

I’m not sure what that memory metric means. This is the query behind the report:

exec sp_executesql N'WITH DateGenerator AS
(
SELECT CAST(@interval_start_time AS DATETIME) DatePlaceHolder
UNION ALL
SELECT  DATEADD(d, 1, DatePlaceHolder)
FROM    DateGenerator
WHERE   DATEADD(d, 1, DatePlaceHolder) < @interval_end_time
), WaitStats AS
(
SELECT
    ROUND(CONVERT(float, SUM(ws.total_query_wait_time_ms))*1,2) total_query_wait_time
FROM sys.query_store_wait_stats ws
    JOIN sys.query_store_runtime_stats_interval itvl ON itvl.runtime_stats_interval_id = ws.runtime_stats_interval_id
WHERE NOT (itvl.start_time > @interval_end_time OR itvl.end_time < @interval_start_time)
GROUP BY DATEDIFF(d, 0, itvl.end_time)
),
UnionAll AS
(
SELECT
    CONVERT(float, SUM(rs.count_executions)) as total_count_executions,
    ROUND(CONVERT(float, SUM(rs.avg_duration*rs.count_executions))*0.001,2) as total_duration,
    ROUND(CONVERT(float, SUM(rs.avg_cpu_time*rs.count_executions))*0.001,2) as total_cpu_time,
    ROUND(CONVERT(float, SUM(rs.avg_logical_io_reads*rs.count_executions))*8,2) as total_logical_io_reads,
    ROUND(CONVERT(float, SUM(rs.avg_logical_io_writes*rs.count_executions))*8,2) as total_logical_io_writes,
    ROUND(CONVERT(float, SUM(rs.avg_physical_io_reads*rs.count_executions))*8,2) as total_physical_io_reads,
    ROUND(CONVERT(float, SUM(rs.avg_clr_time*rs.count_executions))*0.001,2) as total_clr_time,
    ROUND(CONVERT(float, SUM(rs.avg_dop*rs.count_executions))*1,0) as total_dop,
    ROUND(CONVERT(float, SUM(rs.avg_query_max_used_memory*rs.count_executions))*8,2) as total_query_max_used_memory,
    ROUND(CONVERT(float, SUM(rs.avg_rowcount*rs.count_executions))*1,0) as total_rowcount,
    ROUND(CONVERT(float, SUM(rs.avg_log_bytes_used*rs.count_executions))*0.0009765625,2) as total_log_bytes_used,
    ROUND(CONVERT(float, SUM(rs.avg_tempdb_space_used*rs.count_executions))*8,2) as total_tempdb_space_used,
    DATEADD(d, ((DATEDIFF(d, 0, rs.last_execution_time))),0 ) as bucket_start,
    DATEADD(d, (1 + (DATEDIFF(d, 0, rs.last_execution_time))), 0) as bucket_end
FROM sys.query_store_runtime_stats rs
WHERE NOT (rs.first_execution_time > @interval_end_time OR rs.last_execution_time < @interval_start_time)
GROUP BY DATEDIFF(d, 0, rs.last_execution_time)
)
SELECT 
    total_count_executions,
    total_duration,
    total_cpu_time,
    total_logical_io_reads,
    total_logical_io_writes,
    total_physical_io_reads,
    total_clr_time,
    total_dop,
    total_query_max_used_memory,
    total_rowcount,
    total_log_bytes_used,
    total_tempdb_space_used,
    total_query_wait_time,
    SWITCHOFFSET(bucket_start, DATEPART(tz, @interval_start_time)) , SWITCHOFFSET(bucket_end, DATEPART(tz, @interval_start_time))
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY bucket_start ORDER BY bucket_start, total_duration DESC) AS RowNumber
FROM UnionAll , WaitStats
) as UnionAllResults
WHERE UnionAllResults.RowNumber = 1
OPTION (MAXRECURSION 0)',N'@interval_start_time datetimeoffset(7),@interval_end_time datetimeoffset(7)',@interval_start_time='2022-01-03 15:57:04.2571919 -06:00',@interval_end_time='2022-02-03 15:57:04.2571919 -06:00'

and

ROUND(CONVERT(float, SUM(rs.avg_query_max_used_memory*rs.count_executions))*8,2) as total_query_max_used_memory,

doesn’t seem to me to be a terribly useful metric.

As a general remark I don’t think anyone here can answer your question with 100% confidence. We have no baseline. Nevertheless here are some ideas on what you could do:

Do you have any metrics collected on both servers? Can you compare them?

The numbers in the popup seem to be cumulative for the whole database for all queries. If my math is right, the average query duration is 400ms. Is this comparable to the old server?

Compare CPU usage. For one day it’s 5,5h on the new server. Assuming 1 logical CPU it’s ~25% CPU average usage. Asuming 2 logical CPUs it’s 12,5% etc. We don’t have the numbers for either server, but knowing average CPU usage and logical processor count on both you should be able to compare CPU time these two. Assuming 2 logical CPUs I would say the usage is low. The wait stats are also comparatively low (I assume Wait Time in the screenshot means waits). So unless you have poison waits you’re fine in my opinion.

In general I would compare wait stats between old and new server. If the new server waits less, then most probably you are fine.

Finally: are the users complaining that some parts of the application are slower now? That would be a strong hint that some queries regressed.

Leave a Reply

Your email address will not be published.