Question :
When using a subquery to find the total count of all prior records with a matching field, the performance is terrible on a table with as little as 50k records. Without the subquery, the query executes in a few milliseconds. With the subquery, the execution time is upwards of a minute.
For this query, the result must:
- Include only those records within a given date range.
- Include a count of all prior records, not including the current record, regardless of date range.
Basic Table Schema
Activity
======================
Id int Identifier
Address varchar(25)
ActionDate datetime2
Process varchar(50)
-- 7 other columns
Example Data
Id Address ActionDate (Time part excluded for simplicity)
===========================
99 000 2017-05-30
98 111 2017-05-30
97 000 2017-05-29
96 000 2017-05-28
95 111 2017-05-19
94 222 2017-05-30
Expected Results
For the date range of 2017-05-29
to 2017-05-30
Id Address ActionDate PriorCount
=========================================
99 000 2017-05-30 2 (3 total, 2 prior to ActionDate)
98 111 2017-05-30 1 (2 total, 1 prior to ActionDate)
94 222 2017-05-30 0 (1 total, 0 prior to ActionDate)
97 000 2017-05-29 1 (3 total, 1 prior to ActionDate)
Records 96 and 95 are excluded from the result, but are included in the PriorCount
subquery
Current Query
select
*.a
, ( select count(*)
from Activity
where
Activity.Address = a.Address
and Activity.ActionDate < a.ActionDate
) as PriorCount
from Activity a
where a.ActionDate between '2017-05-29' and '2017-05-30'
order by a.ActionDate desc
Current Index
CREATE NONCLUSTERED INDEX [IDX_my_nme] ON [dbo].[Activity]
(
[ActionDate] ASC
)
INCLUDE ([Address]) WITH (
PAD_INDEX = OFF,
STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF,
DROP_EXISTING = OFF,
ONLINE = OFF,
ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON
)
Question
- What strategies could be used to improve the performance of this query?
Edit 1
In reply to the question of what I can modify on the DB: I can modify the indexes, just not the table structure.
Edit 2
I have now added a basic index on the Address
column, but that didn’t seem to improve much. I am currently finding much better performance with creating a temp table and inserting the values without the PriorCount
and then updating each row with their specific counts.
Edit 3
The Index Spool Joe Obbish (accepted answer) found was the issue. Once I added in a new nonclustered index [xyz] on [Activity] (Address) include (ActionDate)
, the query times went down from upwards of a minute to less than a second without using a temp table (see edit 2).
Answer :
With the index definition that you have for IDX_my_nme
, SQL Server will be able to seek using the ActionDate
column but not with the Address
column. The index contains all of the columns needed to cover the subquery but it likely isn’t very selective for that subquery. Suppose that almost all of the data in the table has an ActionDate
value of earlier than '2017-05-30'
. A seek of ActionDate < '2017-05-30'
will return almost all of the rows from the index, which are further filtered down after the row is fetched from the index. If your query returns 200 rows then you would probably doing almost 200 full index scans on IDX_my_nme
, which means you will read around 50000 * 200 = 10 million rows from the index.
It’s likely that seeking on Address
will be far more selective for your subquery, although you haven’t given us full statistical information about the query so that’s an assumption on my part. However, suppose that you create an index on just Address
and your table has 10k unique values for Address
. With the new index, SQL Server will only need to seek 5 rows from the index for each execution of the subquery, so you’ll read around 200 * 5 = 1000 rows from the index.
I’m testing against SQL Server 2016 so there might be some minor syntax differences. Below is some sample data in which I made similar assumptions to the above for data distribution:
CREATE TABLE #Activity (
Id int NOT NULL,
[Address] varchar(25) NULL,
ActionDate datetime2 NULL,
FILLER varchar(100),
PRIMARY KEY (Id)
);
INSERT INTO #Activity WITH (TABLOCK)
SELECT TOP (50000) -- 50k total rows
x.RN
, x.RN % 10000 -- 10k unique addresses
, DATEADD(DAY, x.RN / 100, '20160201') -- 100 rows per day
, REPLICATE('Z', 100)
FROM
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) x;
CREATE NONCLUSTERED INDEX [IDX_my_nme] ON #Activity
([ActionDate] ASC) INCLUDE ([Address]);
I’ve created your index as described in the question. I’m testing against this query which returns the same data as the one in the question:
select
a.*
, ( select count(*)
from #Activity Activity
where
Activity.[Address] = a.[Address]
and Activity.ActionDate < a.ActionDate
) as PriorCount
from #Activity a
where a.ActionDate between '2017-05-29' and '2017-05-30'
order by a.ActionDate desc;
I get an index spool. What that means at a basic level is that the query optimizer build a temporary index on-the-fly because none of the existing indexes against the table were suitable.
The query still finishes quickly for me. Perhaps you aren’t getting the index spool optimization on your system or there’s something different about the table definition or the query. For educational purposes I can use an undocumented feature OPTION (QUERYRULEOFF BuildSpool)
to disable the index spool. Here’s what the plan looks like:
Don’t be fooled by the appearance of a simple index seek. SQL Server reads nearly 10 million rows from the index:
If I’m going to be running the query more than once then it probably doesn’t make sense for the query optimizer to create an index each time it runs. I could create an index upfront that would be more selective for this query:
CREATE NONCLUSTERED INDEX [IDX_my_nme_2] ON #Activity
([Address] ASC) INCLUDE (ActionDate);
The plan is similar to before:
However, with the new index SQL Server only reads 1000 rows from the index. 800 of the rows are returned to be counted. The index could be defined to be more selective but this could be good enough depending on your data distribution.
If you aren’t able to define any additional indexes on the table I would consider using window functions. The following appears to work:
SELECT t.*
FROM
(
select
a.*
, -1 + ROW_NUMBER() OVER (PARTITION BY [Address] ORDER BY ActionDate) PriorCount
from #Activity a
) t
where t.ActionDate between '2017-05-29' and '2017-05-30'
order by t.ActionDate desc;
That query does a single scan of the data but does an expensive sort and calculates the ROW_NUMBER()
function for every row in the table, so it feels like there’s some extra work done here:
However, if you really like that code pattern you could define an index to make it more efficient:
CREATE NONCLUSTERED INDEX [IDX_my_nme] ON #Activity
([Address], [ActionDate]) INCLUDE (FILLER);
That moves the sort towards the end which will be much less expensive:
If none of this helps then you’ll need to add more information to the question, preferably including actual execution plans.