Reduce rows read by the database engine for this aggregate query

Posted on

Question :

I would like to tune this query – it is reading more rows than I would like. I’ve added an index as suggested by the Database Tuning Advisor.

The query goes like this:

SET STATISTICS TIME, IO ON;

SELECT TOP 1000 project,
    Max(timestamp) timestamp
FROM   the_table
WHERE  deviceid = 581
GROUP  BY project

The IO , Time stats are like that:

Table 'the_table'. Scan count 1, logical reads 2373, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

 SQL Server Execution Times:
   CPU time = 390 ms,  elapsed time = 829 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

The execution plan can be found here.

The dta_index this query seeks on consists of projectnumber and timestamp. The table has roughly 4,500,000 rows, consuming approximately 350MB and each deviceid has roughly the same amount of records (i.e. uniform distribution and each device has roughly 700K records).

My problem: even though I am requesting only the first 1,000 rows from the table, no matter what index I have created or how I have tried to re-write the query, I always ended up with reading +/- 700K rows (i.e. all the rows for the selected device in WHERE clause). I would be grateful if you could drop your tips about how to tune this query.

I don’t think there is a way of narrowing this query faster than touching the approx 700K rows for particular deviceid with the first operator of exec. plan.

Here’s a minimal reproducible example, including the table definition and insertion of two sample rows:

CREATE TABLE [dbo].[the_table]
(
    [primary_key_col] [int] IDENTITY(1,1) NOT NULL,
    [Project] [int] NULL,
    [DeviceID] [int] NULL,
    [TimeStamp] [datetime] NULL,
    [Quantity] [int] NOT NULL,
    PRIMARY KEY CLUSTERED 
    (
        [primary_key_col] ASC
    ) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY];

SET IDENTITY_INSERT [dbo].[the_table] ON;
INSERT [dbo].[the_table] ([primary_key_col], [Project], [DeviceID], [TimeStamp], [Quantity]) VALUES (4825107, 1232, 587, CAST(N'2021-10-23T05:48:48.000' AS DateTime), 1);
INSERT [dbo].[the_table]  ([primary_key_col], [Project], [DeviceID], [TimeStamp], [Quantity])  VALUES (4825108, 6458, 581, CAST(N'2021-10-23T05:49:17.000' AS DateTime), 1);
primary_key_col Project DeviceID TimeStamp Quantity
4825107 1232 587 2021-10-23 05:48:48.000 1
4825108 6458 581 2021-10-23 05:49:17.000 1

The indexes aren’t part of the question deliberately, no matter which index did I try, I always have hit 700K rows for particular deviceID.

Thanks.

Answer :

“My problem: even though I am reading only first 1000 rows from the table”
You are not. You are reading the first 1000 results (without any ordering) of the grouped query – top 1000 comes last in the processing order.

If you want to take a “random” 1000 rows that match the deviceid predicate then you could use a subquery:

select project, max(timestamp) timestamp
from (select top 1000
             project, timestamp
      from   the_table
      where  deviceid = 581
     ) sq
group by project

Sorry, but from my point of view you should re-think your query logic, because it gets all rows matching the deviceid and take top 1000 from that matching rows, not TOP 1000 from the table. You can see it also in your execution plan: on the right hand side is a table seek for deviceid and TOP is the last operator on the left side. That’s why you get this amount of rows.

Leave a Reply

Your email address will not be published. Required fields are marked *