I know when it comes to using an index or a table scan, SQL Server uses statistics to see which one is better.
I have a table with 20 million rows. I have an index on (SnapshotKey, Measure) and this query:
select Measure, SnapshotKey, MeasureBand from t1 where Measure = 'FinanceFICOScore' group by Measure, SnapshotKey, MeasureBand
The query returns 500k rows. So the query selects only 2.5% of the table’s rows.
The question is why SQL Server does not use the nonclustered index I have, and uses a table scan instead?
Statistics are updated.
Good to mention that the query performance is good though.
CREATE TABLE [t1]( [SnapshotKey] [int] NOT NULL, [SnapshotDt] [date] NOT NULL, [Measure] [nvarchar](30) NOT NULL, [MeasureBand] [nvarchar](30) NOT NULL, -- and many more fields ) ON [PRIMARY]
No PK on table, as it is a data warehouse.
CREATE NONCLUSTERED INDEX [nci_SnapshotKeyMeasure] ON [t1] ( [SnapshotKey] ASC, [Measure] ASC )
Index seek might not be the best choice if you return many rows and/or the rows are very wide. Lookups can be expensive if your index is not covering. See #2 here.
In your scenario, the query optimizer estimates that performing 50,000 individual lookups will be more expensive than a single scan. The optimizer’s choice between scan and seek (with RID lookups for the columns needed by the query, but not present in the nonclustered index) is based on the estimated cost of each alternative.
The optimizer always chooses the lowest cost alternative it considers. If you look at the Estimated Subtree Cost property in the root node of the two execution plans, you will see that the scan plan has a lower estimated cost than the seek plan. As a result, the optimizer chose the scan. That is essentially the answer to your question.
Now, the cost model used by the optimizer is based on assumptions and “magic numbers” that are quite unlikely to match your system’s performance characteristics. In particular, one assumption made in the model is that the query starts executing with none of the required data or index pages already in memory. Another is that sequential I/O (expected for a scan) is cheaper than the random I/O pattern assumed for RID Lookups. There are many other such assumptions and caveats, far too many to go into in detail here.
Nevertheless, the cost model as a whole has been shown to produce generally “good enough” plans for most queries, on most database schemas, on most hardware configurations, most of the time, everywhere. That is quite an achievement, if you think about it.
Model limitations and other factors will sometimes mean the optimizer chooses a plan that is not, in fact, “good enough” at all. You report that “performance is good”, so that does not seem to be the case here.
You actually have 595,947 matching rows, which is about 3% of your data. So the cost of the lookup adds up quickly. Suppose you have 100 rows per page in your table, that’s 200,000 pages to read in a table scan. That’s a lot cheaper than doing 595,947 lookups.
GROUP BY clause in the question, I think you’ll be better off with a composite key on (Measure, SnapshotKey, MeasureBand).
Look at the “missing index” suggestion. It tells you to include columns to avoid the lookups. More generally, if you reference other columns in your query, they will need to be in the keys or
INCLUDE clause of the new index. Otherwise it will still need to do the 595,947 lookups to get those values.
For example, for the query:
select Measure, SnapshotKey, MeasureBand, SUM(NumLoans), SUM(PrinBal) from t1 where Measure = 'FinanceFICOScore' group by Measure, SnapshotKey, MeasureBand
…you would need:
CREATE INDEX ixWhatever ON t1 (Measure, SnapshotKey, MeasureBand) INCLUDE (NumLoans,PrinBal);
The field in your WHERE condition is not the leading field of the index.
measuredefined as NVARCHAR so prefix the literal with an
where Measure = N'FinanceFICOScore'.
Consider creating a Clustered Index on
SnapshotKey. If it is unique then it can be a PK (and Clustered). If not unique then it cannot be a PK, but can still be a non-unique Clustered Index. Then your non-clustered index would be only on the
And, considering that the first field in the
GROUP BY is also
measure, that would also benefit from having
measure be the leading field.
In fact, for this operation, you might need to instead define the NonClustered Index on
Measure, SnapshotKey, MeasureBand, in that exact order as it matches the
GROUP BY clause. Size-wise that is only really adding
MeasureBand since the NonClustered index is already based on
MeasureKey is already included in the index as it is now the Clustered Index key (no,
Measure won’t be duplicated in the NonClustered index).
@Rob had mentioned in a now-deleted comment on his answer that solving this issue requires only that the NonClustered Index be defined with these three fields in this order, and that creating a Clustered (non-unique) Index on
SnapshotKey isn’t necessary. While he is probably correct (I was hoping that fewer fields would work), I would still contend that having the Clustered Index is beneficial for not just this operation, but probably most others.