Does it make sense to add indexes to a clustered columnstore index table?

Posted on

Question :

Question is similar to this one, but the answer doesn’t seem to answer this question.

My understanding of a clustered columnstore table (and please correct me if I am wrong) is that each column is stored in some physically ordered manner, implying that each column already has what amounts to a clustered index on it. If that were the case, it wouldn’t make a lot of sense to add any more indexes on the table… or would it? Perhaps a composite index?

Is my thinking correct on this?

Answer :

My understanding of a clustered columnstore table (and please correct me if I am wrong) is that each column is stored in some physically ordered manner…

All columns are sorted in the same order. I say this to distinguish from a possible reading of your statement that each column is sorted in the best way for that column, which isn’t true.

Each column is encoded and compressed individually in a segment within each row group. The order of operations is:

  1. Encode values in all columns (e.g. value offset, or using a dictionary).
  2. Determine the single ‘optimal’ row ordering.
  3. Compress each column (e.g. RLE, bit-pack).

The common sort order chosen may be great for some segments, but not others from a compression point of view (primarily run length encoding). SQL Server uses Vertipaq technology to choose a sort order that gives a good overall compression result.

…implying that each column already has what amounts to a clustered index on it

This isn’t the right mental picture because column store ‘indexes’ do not support seeks, only b-trees do. SQL Server can locate an individual row in a column store for lookup purposes (as in the linked Q & A) but that isn’t the same as supporting key seeks in general.

Demo of a lookup on a clustered columnstore index.

Clustered column store lookup

The distinction is between finding row #n in row group #m versus finding rows with a given value ‘x’ for a particular column. A column store index can do the former, but not the latter.

Additionally, b-tree indexes can return ordered data; column store indexes cannot (due to encoding) — except for the trivial case of a lookup where at most one row will be returned.

Does it make sense to add indexes to a clustered columnstore index table?

Additional b-tree indexes make sense when we need to locate a single value, or a small range of values. The alternative would be to scan the column store index (albeit with possible row group elimination if the stars align correctly – as well as partition elimination if the table is partitioned).

More generally, b-tree indexes support OLTP-type queries; column store is for analytics. Providing both on the same table enables a hybrid approach (HTAP – Hybrid Transactional/Analytic Processing).

From the Microsoft Research Paper Columnstore and B+ tree –
Are Hybrid Physical Designs Important?
(PDF):

B+ trees outshine columnstores when query
predicates are selective even when all data is memory resident; and
the trade-off shifts further in favor of B+ trees when data is not
memory resident. Likewise, B+ trees can be a better option for providing data in sorted order when server memory is constrained. On
the other hand, columnstores are often an order of magnitude faster
for large scans whether or not the data is memory resident. For
updates, B+ trees are significantly cheaper. Secondary columnstores
incur much lower update cost compared to primary columnstore
indexes, but are still much slower than B+ trees. This empirical
study indicates that for certain workloads, hybrid physical designs
can provide significant performance gains.

It can make sense to do so, yes. Sometimes a business requirement will need to be enforced via a primary key or a unique constraint. Both of those result in an underlying index on a table with a clustered columnstore index.

A clustered columnstore index isn’t a good data structure for some types of queries. Examples where extremely poor performance are possible include string aggregation on SQL Server 2016, recursive queries, and joins without an equality condition. More generally, any query that requires a quick lookup of a small amount of data from a table may not see the best performance from just a clustered columnstore index. It is true that rowgroup elimination can sometimes lead to similar performance relative to a nonclustered index, but that requires loading the data in a certain order and not all data types are supported.

There are downsides to creating nonclustered indexes on tables with a clustered columnstore index. Parallel insert is no longer available, the indexes take up additional storage space, and SQL Server will have fewer options for compression order in some scenarios.

In my experience, I’ve found that it’s best to set a higher bar than usual when considering adding a nonclustered index to a table with a clustered columnstore index. If you truly need it then try adding it but keep the downsides in mind.

Leave a Reply

Your email address will not be published.