I am using MS SQL Server 2016 Clustered Columnstore Indexing (let’s call it CCI) in my reporting database.
In initial designs I was thinking star schema but then I started playing with CCI. Now I have discarded many dimension tables in favor of flattening the strings directly into the “fact” table. The only place I retain the dimension table is when that dimension has attributes that change frequently AND the requirement is to have the changed attribute made applicable to all historical records. I have done this much to the dismay of a colleague who has more DW experience but no free time to explore CCI.
It appears that flat tables stored on disk as individual columns (and the massive compression that affords) need not be narrow at all. When does one still need dimension tables when using CCI?
I don’t think that your question is applicable to any RDBMS that supports columnar storage. I’m writing my answer from the point of view of SQL Server and most of the reasons depend on implementation details specific to SQL Server.
When does one still need dimension tables when using CCI?
1. The volume of changes to the dimension table makes updating the CCI fact table impractical
With fact tables of 500 M rows you might need to update hundreds of millions of rows in a CCI if some of the dimension columns change in an unlucky way. The only practical ways that I know to do this are to rewrite the whole table or to do a delete + insert. For the delete + insert approach you’ll likely need to write the data for all columns to a staging area, wait for the serial delete queries to finish (unless you can delete by partition), read all of the columns for all of the rows for rowgroups which might contain a row that needs to be changed, and so on. It can be a hassle to code and pretty expensive to convert the data. The problem gets worse as your fact table gets wider.
2. The length and number of string columns makes CCI compression impractical due to memory limits
Memory grant requests for string columns can get out of control depending on how you’re building the CCI. For example, a
REBUILD of a
VARCHAR(8000) column requests 6.5 GB per DOP and scales down with column length. Memory grant requests time out for CCI inserts at 25 seconds (as far as I know there’s no way to change this). This means that some of your CCI insert queries could start directly writing to the delta store (along with deadlocking and other bad things) if you don’t have enough memory to perform the compression.
3. Your ETL or maintenance processes aren’t designed to prevent or clean up delta stores
You mention “massive compression” in your question, but data in delta stores isn’t compressed. If your ETL process creates a heap and later compresses that data into columnstore format then you could be using a lot more temporary space for staging than you’re used to. If you do a lot of parallel inserts into partitioned tables you could end up with thousands or more delta stores for which data won’t be compressed, and so on.
4. The dimension table has a lot of unique, long strings
SQL Server 2016 is limited to 16 MB dictionary sizes per column. If a column has too many unique values then you can exceed that limit and the rowgroup will be split up due to dictionary pressure. Adding string columns to an existing CCI fact table can result in smaller compressed rowgroups which can reduce the effectiveness of compression and querying performance.