A lot of applications store logging information in a log table.
These tables are special because you will only do inserts. You will never do updates. If you do deletes it will be a nightly job to get rid of old data.
The table has a datetime field which is typically strictly increasing. It is possible that the calling application will lose its connection to the db, and will stack up inserts, and at the point when it gets the connection it can’t guarantee that it will do inserts in strict order. But in general I expect that it will be cheap to sort the table on insert if the datetime field is the clustered index.
Most queries will query on datetime, but these will be inequality queries.
With these special properties it feels like there should be a way to optimize it.
What are the best practices?
An example of a log table could be:
CREATE TABLE logMessages ( logTime datetime2(6) NOT NULL, logSeverity varchar(10) NOT NULL, logStatus varchar(10) NOT NULL, logText varchar(255), processID bigint, processUser varchar(25) )
A typical query:
SELECT logTime, logSeverity, logText FROM logMessages WHERE logTime >= '2020-10-01' AND logTime < '2021-11-01' AND logSeverity IN ('WARNING','ERROR','FATAL')
The datetime field is not unique, do we need a unique identifier?
Option 1: add a
logId BIGINT IDENTITY column
If we add a unique logId, it will not be used as a foreign key in any other tables.
If we use this as a clustered index alone and SQL Server decides that our query will retrieve too many rows for it to use a non-clustered index, it will do a full table scan.
Would it make sense to set the clustered index to (logTime, logId)? The reason I want this is that when the query optimizer will not use a non-clustered index on logTime because it expects too many rows, falling back on a good clustered index will reduce the number of rows to scan.
Option 2: not adding a logId
Set the clustered index to (logTime). Since logTime is not unique SQL Server will have to uniquify it by adding an invisible column. Is this a problem? At least I get a good clustered index that will help most of my queries.
logSeverity is a small int with predefined values (DEBUG,INFO,WARNING,ERROR,FATAL), would it be better to move these to a different table and represent them with a tinyint?
Experts typically say that you can use
WITH (NOLOCK) if you don’t need accurate data. Are there any situation where you can use it and still get accurate data? In this case we will not do UPDATE or DELETE, only INSERT. The inserts will also typically be at the end of the table (the datetime field will be increasing). A query that will not look at the last few seconds should not be affected by an insert.
To sum it up:
- What should we do about unique key?
- How should our clustered index be?
- Should we use lookup tables?
- Can we use WITH (NOLOCK)?
- Any other best practices?
To directly answer each of your questions:
Q: What should we do about unique key?
Q: How should our clustered index be?
A: In most cases it’s important to have a
primary key so you can guarantee uniqueness (including this case). Only in a staging table it could possibly be an exception to not use one. It may even serve useful for adhoc debugging later on and is cheap to use an
BIGINT auto-increment field as one. But you can (and should) still make your logTime column the
clustered index on the table for performance reasons, since it’s what will be used as a predicate in a lot of your queries, and the uniqifier step SQL Server does is super negligible from a performance perspective. If you do add a LogId as the primary key, you can add a
nonclustered index to it. That way if you need to do any adhoc maintenance (e.g. deleting / updating specific records in the table, you can efficiently seek them by the primary key).
Q: Should we use lookup tables?
A: In this case I would recommend using a lookup / “enums” table. Not so much from a space saving or performance perspective, but mainly because if one of those values change you can very efficiently maintain it by only updating the lookup table as opposed to having to update every record referencing that value in the main table.
Q: Can we use WITH (NOLOCK)?
A: It is not possible for you to use the WITH (NO LOCK) query hint here and be guaranteed to always have 100% accurate data, even if you’re selecting from a different set of rows than the ones being
UPDATED. I’ve maintained queries in the past that do use it and very rarely have had an issue, but my personal choice (unless data accuracy is not a concern) is to avoid it. There are much better performance improvement alternatives for query tuning. By the way, I once thought the same way about the NOLOCK hint and you can see Microsoft’s David Browne’s answer to my question for a conclusive reason why the data accuracy isn’t guaranteed. In short, the rows you’re selecting against may be part of a changing data structure because of the other further away rows that are being
UPDATED also being part of that same data structure, e.g. B-Tree for a particular
index both rows are a part of.
Q: Any other best practices?
A: Maybe consider other possible ways logs might need to be queried, e.g. all logs by a specific user, specific process, or both (even irrespective of when they occurred)? If so, adding a couple extra
nonclustered indexes on those fields could help you later on too. E.g.
nonclustered index on processUser, processID and then a separate
nonclustered index just on processID (the first
index would be covering for predicates that only use processUser, or use processUser and processID which is why you don’t need a third
nonclustered index that’s just on processUser here).