We are taking legacy flat txt files and inserting them into stage tables with SSIS.The question arose whether table should have primary clustered key index or not. This is direct flat file import with no transformation.
create table dbo.CustomerTransaction ( CustomerName varchar(255), PurchaseLocation varchar(255), Productid int, AmountSold float, CustomerAddress varchar(50) ) create table dbo.CustomerTransaction ( -- discussion for adding this column CustomerTransactionId int primary key clustered identity(1,1) CustomerName varchar(255), PurchaseLocation varchar(255), Productid int, AmountSold float, CustomerAddress varchar(50) ) -- both tables have nonclustered indexes create nonclustered index idx_ProductId on dbo.CustomerTransaction(ProductId) create nonclustered index idx_CustomerAddress on dbo.CustomerTransaction(CustomerAddress) -- Actually have more indexes, tables above are just for sample
1) Before ETL, the staging tables are truncated. There are No Deletes and No Updates. Only Inserts.
truncate table dbo.[CustomerTransaction]
2) Then disable all indexes before ETL.
alter index all on dbo.[CustomerTransaction] DISABLE
3) We conduct SSIS data flow with default fast load, which I read is equivalent to bulk insert. No transformations occur here.
4) Then reenable all indexes after import is done.
alter index all on dbo.[CustomerTransaction] REBUILD
5) The staging tables are then selected on join and where clauses, and placed into datawarehouse. This we why we have nonclustered indexes. After data warehouse is loaded, we truncate the staging tables.
We are hearing information that ETL Stage tables are good as heaps. However, also learning of fragmentation and performance issues with heaps. Reading all the articles below
I am reading conflicting opinions. One says Binary tree clustered are maintenance headaches for import ETL. Other says Heaps have performance issues with fragmentation. Our performance testing does not show much difference, but our data may change later. So we need to make a good design decision.
We know a good reason to have identity is for row labelling, however question is mostly about internals and performance.
We had a similar scenario and recently switched our staging tables from clustered indexes to heaps. The first big advantage for us was that we wanted to allow concurrent SSIS loads into the same staging table. You can do that with a clustered index, but you’ll likely run into a lot of blocking, especially with an identity column. The second big advantage was cutting down on the overhead of loading the staging tables. We found that our loads went much faster on heaps compared to clustered indexes.
Our performance testing does not show much difference, but our data
may change later. So we need to make a good design decision.
Are you sure that this is true? In the question you say that you truncate your staging tables before the load. If some part of your load process changes, it should be very straightforward to add or remove a clustered index while the tables are empty. There’s no data movement involved. It doesn’t sound like you would get any benefit from a clustered index, so I would try it out as a heap and monitor performance.
Having an identity column doesn’t force you to use it as a clustered index key.
You’re right that heaps work well here. I would consider Thomas Kejser to be the authority on the subject, and it’s good you’ve listed him as one of your resources.
As for fragmentation in heaps – doesn’t happen on insert-only.
Edit: Go through this article about parallel insert, and notice the comparisons between heaps and clustered indexes. https://blogs.msdn.microsoft.com/sqlcat/2016/07/21/real-world-parallel-insert-what-else-you-need-to-know/