Is there a way to specify an table/index is sorted by more than one column?

Posted on

Question :

Suppose I have a table:

CREATE TABLE [dbo].[TestTable]
(
    [Id] INT NOT NULL PRIMARY KEY CLUSTERED IDENTITY(1,1),
    [Timestamp] DATETIME2(7) NOT NULL DEFAULT (SYSUTCDATETIME()),
    [Data] INT NOT NULL
);

And suppose I need to look up rows by ID (e.g. for joins) and by Timestamp (e.g. for date-range search).

The usual way to do this would be to add an index on Timestamp, but this seems like a massive waste of resources (almost a complete copy of the table) when Timestamp should be in order in the table in the first place. So two questions:

  1. Is there a way to enforce Timestamp being strictly ascending along with Id? (i.e. to make sure pathological cases like changing the server’s system clock, leap seconds, Id‘s Identity-ness being overruled).
  2. Is there a way for SQL Server to make use of the above enforcement and treat the table as indexed on both Id and Timestamp?

I suspect the answer to (1) may be yes, but that (2) is probably a no, rendering (1) a bit useless, but I’m happy to be proven wrong.

Any version of SQL Server acceptable

Answer :

You can create a clustered index on both id and timestamp columns and even define different sorting order for each column if you want to:
create unique clustered index idx on TestTable(id asc, [timestamp] asc)
But it won’t make much sense in this case for these reasons:

  1. The engine does not know that timestamp increases along with id, so if you run a query with condition on timestamp, it’ll still perform a full scan.
  2. All additional columns that are included in clustered index will also be included in all non-clustered indexes.
  3. An older timestamp still could be inserted into the table with newest id.

So you are gaining nothing and only wasting space when you create a new index and slowing down inserts as timestamp order still will have to be maintained.

Creating an index only on timestamp column seems like a waste of resources (2 out of three columns of entire table will be included in an index), but actually it’s very common for size of all indexes be greater than the size of the table. Slow queries are also wasting a lot of resources, so just create it if it’ll boost select queries and if slower inserts will not be a problem.

To finally answer your question: as you can’t implement this logic using indexes and constraints, the only way I see is to create an INSTEAD OF INSERT Trigger. It’ll hurt insert performance, but prior to insert you can check if older timestamp already exists in the table and then modify the timestamp, not insert the data, raise an error – up to you.

You can’t apply two diffrent sort orders (by diffrent column) using only one index. An index is a structure which store a data on a disk in logical order (B-tree approach) based on columns from index declaration.

It is worth pointing out, that a storage isn’t usually problem nowdays and you should focus on CPU, network and engine (IO operations) efforts. And even in this case, you didn’t say that, you just said that it would be waste of resources, but it isn’t. In many cases, use of additional space for indexes gives your CPU, network and SQL Engine relief when quering data. Sometimes it is good solution for deal with concurrency or deadlock issues too.

In your case, if you want to ensure ability to identify unique row, you have to create Id column. You shouldn’t rely on Timestamp in this matter.

Let’s consider following approaches.

Solution 1

  • Create clustered index on Timestamp column.
  • Create Primary Key as unique nonclustered index only on Id column (Timestamp will be included as clustered index key).

You said about range scans with Timestamp column, so better to have Data content in the same index (range means lot of data which should be availabe without lookup). This requires for example clustered index on Timestamp column (which is WHERE predicate). This will be most efficient approach to query data this way.

For other queries which use Id column, engine will use second nonclustered index. But if Data column is required, it will cause Key Lookup (additional effort for CPU, engine, IO).

It’s quite good resolution if your queries mainly use Timestamp as WHERE predicate and require Data column. Still there is some effort when querying by Id column.

In terms of storage, one row have 16 bytes for clustered index and 12 bytes for nonclustered index (28 bytes), so let’s think about something similar.

Solution 2

  • Create Primary Key as unique clustered index only on Id column. Most narrow option.
  • Create nonclustered index on Timestamp column and include Data column (Id will be included as clustered index key).

Best way in terms of query performance would be to create two covering indexes. In this approach there will be no need to lookup when quering data by Id or Timestamp column, just smooth read from appropriate index (seek or scan).

Compared to Solution 1 there is additional space used (4 bytes), but we gain benefits when query data (no key lookups).

Remarks

Always create clustered index firstly, creating it as last one enforce rebouild all nonclustered indexes (replace RID with Clustered Key).

It’s also worth noting, that it will be some impact on insert/update/delete operations in both solutions presented above. But it would be acceptable looking at the structure and row size, and still benefits should compensate it.

And one more thing.

It looks like you want implement time series database in SQL Server.

There are complex solution to achive this like Azure Time Series Insights, InfluxDB database or others similar database engine. If you want to stick with SQL Server, you can look for articels about it.

Leave a Reply

Your email address will not be published. Required fields are marked *