I have some tables used in reporting and some of them grow huge with the daily ETL so I have implemented som jobs that delete rows more that x days old. E.g. everyday 10% new data gets added and I delete 10% of the oldest rows in another job.
Do I need to do something for efficiency? Some tables are heaps with nonclustered indexes and some also have clustered indexes. E.g. to I need to rebuild the tables in case of heaps or rebuild index in case of clustered index on the table? If so how often? Most tables are only used once per day and when all joins and calculations are done they get extracted for visualization in a non-live manner.
I know a little about index fragmentation and heap fragmentation and that one can query fragmentation in percent.
How much do I need to worry about the above scenario?
You can identify and resolve heap / index fragmentation following this guide: How to identify and resolve SQL Server Index Fragmentation
You likely only need to reorganize, and not worry so much about rebuilding as often. There’s recommendations for when to use one over the other, but rebuilding is a heavier operation that likely doesn’t result in much more benefit than reorganizing.
While 20 million rows is a nice little chunk of data, it isn’t anything enormous by any means. I would still potentially recommend always using clustered indexes and you’ll see the best benefits during the JOINs and calculations on those tables.
Using clustered indexes in a smart way will also reorganize your data for you in a sense. What I mean by that, is if every day you’re loading 20 million records into an empty table, and then add a clustered index on that table after the data has been fully loaded, it’ll efficiently organize the data sorted on that clustered index. It can basically take the place of manually rebuilding / reorganizing operations that are pretty heavy operations by themselves. Adding the clustered index to the table after the data has been fully loaded is usually the most efficient way for it to reorganize the data.
Check out the online documentation for the DELETE statement, especially the statement on locking behavior since the underlying table is a heap.
Converting the heap to a clustered index(as discussed in that section of documentation) can also help BUT it comes with a big caveat: different insert behavior for heap vs clustered index can have a significant impact on ETL performance especially if concurrent inserts into the table.
If the underlying table is a clustered index, then for the CI and all non-clustered indexes both rebuild and reorganize are available. Either can allow maintaining a high average percent full in the data pages; there are many factors to consider in choosing between them that are out-of-scope for the original question.
That section of DELETE documentation also discusses the tablock hint. The tablock hint is especially important to consider if the table is to remain a heap rather than be converted to a clustered index. If the tablock hint can be added to the DELETE statement, the page level locks will allow pages to be de-allocated from the heap when emptied by the delete, allowing the heap to remain more compact.
If the table remains a heap, while its non-clustered indexes have both reorganize and rebuild options, the heap does not have a reorganize option. But there is an ALTER TABLE… REBUILD option. This option should be used with great care. It will result in all non-clustered indexes for the table being rebuilt. Paul Randal discusses this in the blog post below.
So, changing the heap to a clustered index is an important schema change that should be considered carefully.
If the tablock hint can be added to the DELETE, I recommend that.
If ALTER TABLE… REBUILD is needed, I recommend as seldom as possible – hopefully only once and afterward leaning on the tablock hint.
And using reorganize or rebuild to maintain the non-clustered indexes as necessary.
I wouldn’t worry about fragmentation instead I would think about partitioning (if your version and edition allows) so instead of DELETING rows I would truncate (switch) partitions.
I’d like to focus specifically on the re-use of space aspect. (Lots if useful information has already been said in other answers.)
In essence, you should monitor the size of each heap and b-tree over time. Or at least as a one-time thing before and after deleting data.
Whether space is reclaimed? Well, it depends:
For a heap, you end up with the case where SQL server do not reclaim storage after deleting data. Doing a table rebuild would be essential here (or convert the heap to a clustered table).
For an index, it depends on how data is organized compared to what you delete. If you have an index on “OrderDate” and delete the oldest data, then pages will just be deallocated from the beginning of the linked list of pages; and all is fine.
But you also have an index on LastName for the same table (bad example I know, but have some imagination 🙂 ), then deleting data based on old OrderDate will spread out the removed rows all over the place for your LastName index, leaving un-used space in the indexes. Is that bad? Perhaps not, since you later add new data, and that is likely to have a similar spread of LastName as the rows you removed, meaning you won’t get page splits.
So, look at the heap as one type of entity, B-Trees where the order matches your removal/add data as another entity and B-Trees that doesn’t match as a third type of entity.
And as already been said, partitioning can be helpful!