Should Archive Tables have their own Surrogate Identity Key

Posted on

Question :

Should archive tables have their own surrogate identity id/key? For a customer sales table example:

  • We have original table, and then an additional archive table with surrogate archive id.
  • Rules for primary clustered indexes are:
    • Ever Increasing
    • Narrow
    • Unique
    • Static

Reference: Effective Clustered Indexes (Red-Gate Hub)

At the last company I worked for, there was a thought, that if we accidentally missed a day or period of importing, e.g. imported data for May 3, accidentally skipped May 4 (issue with system), and imported May 5, that without the new archive id, we would have to reinsert data between the pages, causing fragmentation, and slower inserts.

With the archive id, we can just add on in an ever-increasing manner.

Just inquiring if adding archiveid column is a standard industry practice.

Original Table:

create table dbo.CustomerSalesId
    CustomerSalesId bigint primary key identity(1,1),
    CustomerId bigint not null,
    PurchaseDate datetime not null,
    Amount decimal (10,2) not null,

Archive Table:

create table dbo.ArchiveCustomerSalesId
    ArchiveCustomerSalesId bigint primary key identity(1,1),
    CustomerSalesId bigint,
    CustomerId bigint not null,
    PurchaseDate datetime not null,
    Amount decimal (10,2) not null,
create unique index ux_CustomerSalesId on ArchiveCustomerSalesId(CustomerSalesId)

Answer :

Fragmentation is not as bad as you think it is. Page fullness (internal fragmentation) is much more important than the physical order of the pages (external fragementation).

Having rows inserted out of order might affect the performance of scans when the pages are read from disk. When reading pages from memory, it is not important whether the pages are in the correct order or not. Also, you are probably not working with a physical server and your disk is likely a LUN sitting on a SAN, so the advantage of page contiguity is negligible.

A good visual description of internal vs. external fragmentation can be found here:

A great article on why fragementation is not what you think can be found on Brent Ozar’s blog:

That said, using an additional surrogate key just for avoiding fragmentation is not a good idea. A good clustered index is, first of all, the key that you use the most to locate the data, so that you don’t have to perform additional lookups. If you alter the clustering key when you archive the data, to locate the rows you would have to use a unique nonclustered index on the column(s) of the previous clustering key. A nonclusterd index is a copy of the data that requires additional space and also requires lookups to extract columns that are not part of the index.

Wrapping up: no, you don’t need an additional “archiveId” surrogate key.

Recently I went through this exercise for the system on which I’m currently working. I chose to retain the current surrogate IDs.

Mostly this was to retain historical references. We have various secondary processes which use the ID as an external reference. This may not be best practice, but it’s the way it is. Internal logging captures the ID, too. For these reasons I’d want to retain the active table’s ID in the archive table’s values. It seemed redundant to include another column which performed the same purpose.

As for fragmentation, this is not an issue. These rows will be used in two circumstances – point lookup (what was row X’s value of column Y?) or broad aggregates (how many Zs did we have in January?). Either way, having a bit of fragmentation will not hurt us.

As and when our volumes grow significantly, and our analytics become more complicated, I may need to re-visit this. Likely I’d use partitions and online rebuilds (SQL Server 2014+).

Leave a Reply

Your email address will not be published. Required fields are marked *