Question :
Is there a general Selectivity rule of when to apply Nonclustered index?
We know not to create an index on a bit column, 50/50. “Rows with 50/50 distribution, it might buy you very little performance gain ” Index Bit Field in SQL Server
So how selective should a query be in SQL Server, before an index should be applied? Is there a general rule in SQL Server guidelines? At 25% average selectivity distribution in a column? 10% Selectivity?
This article is stating around 31%? How Selective Should Index Be?
Answer :
Considering column selectivity only when deciding which columns to index ignores quite a bit of what indexes can do, and what they’re generally good for.
For instance, you may have an identity or guid column that’s incredibly selective — unique, even — but never gets used. In that case, who cares? Why index columns that queries don’t touch?
Much less selective indexes, even BIT
columns, can make useful, or useful parts of indexes. In some scenarios, very un-selective columns on large tables can benefit quite a bit from indexing when they need to be sorted on, or grouped by.
Joins
Take this query:
SELECT COUNT(*) AS records
FROM dbo.Users AS u
JOIN dbo.Posts AS p
ON u.Id = p.OwnerUserId;
Without a helpful index on OwnerUserId
, this is our plan with a Hash Join — which spills — but that’s secondary to the point.
With a helpful index — CREATE INDEX ix_yourmom ON dbo.Posts (OwnerUserId);
— our plan changes.
Aggregates
Likewise, grouping operations can benefit from indexing.
SELECT p.OwnerUserId, COUNT(*) AS records
FROM dbo.Posts AS p
GROUP BY p.OwnerUserId;
Without an index:
With an index:
Sorts
Sorting data can be another sticking point in queries that indexes can help.
Without an index:
With our index:
Blocking
Indexes can also help avoid blocking pile-ups.
If we try to run this update:
UPDATE p
SET p.Score += 100
FROM dbo.Posts AS p
WHERE p.OwnerUserId = 22656;
And concurrently run this select:
SELECT *
FROM dbo.Posts AS p
WHERE p.OwnerUserId = 8;
They’ll end up blocking:
With our index in place, the select finishes instantly without being blocked. SQL Server has a way to access the data it needs efficiently.
In case you’re wondering (using the equation Kumar provided) the OwnerUserId column’s selectivity is 0.0701539878296839478
Wrap it up
Don’t just blindly index columns based on how selective they are. Design indexes that help your workload run efficiently. Using more selective columns as leading key columns is generally a good idea when you’re searching for equality predicates, but can be less helpful when searching on ranges.
Formula for Calculating Selectivity:
Suppose you need to calculate selectivity of Col1 in table1
Declare @distinctcount int,@Totoalrows int
Select @distinctcount =count(distinct col1) from table1
select @Totoalrows =count(*) from table1
So Selectivity=@distinctcount /cast(@Totoalrows as decimal(5,2))
Ideal Selectivity =1
i.e that is unique key. So Selectivity should be very near to 1.
We know not to place index on bit column, 50/50. “Rows with 50/50
distribution, it might buy you very little performance gain “
Bit column alone is not very Selective but we can put bit column in composite index
to gain performance.
Or we can use bit column in Filtered index
to gain performance.
This article is stating around 31%?
Read the article carefully. This article is saying Number of data page read during query also matter beside Selectivity.
Two table both with identity column as index is considered to be Ideal Selective.
Table manyrows
with all int column occupy less data page.So less data page are read,so Selectivty of 0.16% is enough.
For table with fewrows
with one column as char(4000)
contain many data pages.So because lot of data page will be read,Selectivity is 31%..
Remember if optimizer has to read data across many data page then optimizer may decide to Scan than seek i.e. Selectivity of Index is ignored.
What you refer to is the “tipping point” at which the SQL Server optimizer decides to go with a table scan instead of an index seek and key lookup.
There are some things to note about that, the tipping point obviously only affects non-clustered indexes (as the key lookup doesn’t have to occur when using a clustered index) and the tipping point also doesn’t come into play when your non-clustered index is covering (either all selected columns are in the key columns or in the included columns of the index).
That being said, the number of rows is not 30% of the total rows. It’s not a fixed value.
The number of rows is somewhere between 25% and 33% percent of the number of pages, so unless you have 1 row per page the percentage of rows is much much smaller.
See the examples by Kimberly Tripp in The Tipping Point Query Answers
If a table has 500,000 pages then 25% = 125,000 and 33% = 166,000. So, somewhere between 125,000 and 166,000 ROWS the query will tip.
Turning that into a percentage 125,000/1million = 12.5% and
166,000/1million = 16.6%. So, if a table has 500,000 pages (and 1
million rows) then queries that return less than 12.5% of the data are
likely to USE the nonclustered index to lookup the data and queries
over 16.6% of the data are LIKELY to use a table scan.If a table has 10,000 pages then 25% = 2,500 and 33% = 3,333. So, somewhere between 2,500 and 3,333 ROWS the query will tip. Turning
that into a percentage 2,500/1million = .25% and 3,333/1million = .33%
(not even 1%). So, if a table has only 10,000 pages (and 1 million
rows) then queries that return less than a quarter of 1% of the data
are likely to USE the nonclustered index to lookup the data and
queries over one third of one percent are LIKELY to use a table scan.If a table has 50,000 pages then 25% = 12,500 and 33% = 16,666. So,
somewhere between 12,500 and 16,666 ROWS the query will tip. Turning
that into a percentage 12,500/1million = 1.25% and 16,666/1million =
1.66% (under 2%). So, if a table has 50,000 pages (and 1 million rows) then queries that return less than 1.25% of the data are likely to USE
the nonclustered index to lookup the data and queries over 1.66% are
LIKELY to use a table scan.
Now to answer your question, how selective should it be before an index is used?
Depending on your row size and number of pages it could be very selective. If you want to make sure your index is used you want your index to be covering.