Question :
In a project I’m working on, every change to the rows in some tables of the database must be tracked for further audit or rollback. It must be easy to find who modified the row, from which IP address and when, and be able to restore previous version.
The similar thing is used for example by Stack Exchange. When I change someone else’s question, it is possible to find that I changed it, and to rollback the changes.
What is the general technique used to store every change to an object in a database, given that my current schema has mostly the same properties (below) that an average business app?
- The objects have a relatively small size: there might be some
nvarchar(1000)
for example, but not huge blobs of binary data, this one being stored directly on disk, and accessed directly, and not through Microsoft SQLfilestream
, - The database load is pretty low and the whole database is handled by one virtual machine on a server,
- The access to the previous versions doesn’t have to be as fast as the access to the latest version, but still has to be up-to-date¹ and not too slow².
<tl-dr>
I thought about the following cases, but I have no real experience with those sorts of scenarios, so I would hear other’s opinions:
-
Store everything in the same table, distinguishing the rows by ID and version. IMO, it is seriously stupid, and will hurt soon or later on performance level. With this approach, it is also impossible to set a different security level to the latest items and to the versions trace. Finally, every query would be more complicated to write. Actually, to access the up-to-date data, I would be forced to group everything by ID and retrieve, in each group, the last version.
-
Store the latest version in one table, and, at every change, copy the obsolete version to another table in another schema. The flaw is that every time, we store every value, even if it didn’t change. Setting unchanged values to
null
is not a solution, since I must also track when the value is changed tonull
or fromnull
. -
Store the latest version in one table, and the list of changed properties with their previous values in another table. This seems having two flaws: the most important one is that the only way to sort heterogeneous types of previous values in the same column is to have a
binary(max)
. The second one is that it would be, I believe, more difficult to use such structure when displaying the previous versions to the user. -
Do the same thing as in two previous points, but store the versions in a separate database. Performance-wise, it might be interesting in order to avoid slowing down the access to the latest versions by having the previous versions in the same database; still, I believe that it’s a premature optimization and must be done only if there is a proof that having older and latest versions in the same database is a bottleneck.
</tl-dr>
¹ For example, it would be unacceptable to store the changes into a log file, as it is done for HTTP logs, and flush the data from the log to the database at night when the server load is lowest. The information about different versions must be available immediately or nearly immediately; a few seconds delay is acceptable.
² The information is not accessed very frequently and only by specific group of users, but still, it would be unacceptable to force them to wait for 30 seconds for the list of versions to display. Again, a few seconds delay is acceptable.
Answer :
The normal way do do audit logging of this sort is to have a shadow table and log changes with triggers on the base table you are auditing. The other tables can be placed on a different physical disk if you need to to that for performance, and you can put indexes on them if you need to support quick retrieval of the data.
The tables will have roughly the same structure as your original tables, but will have a datetime column for when the change took place and a marker for whether the row was inserted, changed or deleted. Sequencing the versions can be done by the time stamp.
The change date can be done by making the datetime column not null with a default of getdate(); an audit user column will capture the user with a not null column defaulted to Suser_Sname(). Assuming the actual user is being impersonated in the session this will capture the identity of the user making the change.
The database has no way to be aware of the IP address connecting to a web server. The application will have to explicitly capture and log the IP address with the transaction.
If you have a large number of tables you wish to audit, you can use the metadata from the system data dictionary to generate the triggers programmatically.
This solution is by far the best for several reasons:
-
It captures any changes to the table, not just those made by the application.
-
The audit tables can be put on a different set of disks to reduce the I/O load on your primary tables.
-
You can use a view based on a union of the table and audit log table to that will show the entire history including the current version.
-
You can index the audit log tables as necessary so that the audit users can query them responsively. As usual, index selection is a tradeoff between query performance and update overhead.
I know of many CMS systems (including WordPress) that use a single table to store all versions of the data. But then again, they only have to do this for the table that has the blog posts. See the WordPress database structure.
Also, the number of records and the number of revisions each row goes through will play a significant role in your decision.
About CMS versioning; for drupal it makes a special table for every field of the entity that stores the old value ; such a concept alllows you a fine manipulation of your data but i think it’s expensive, my own solution is to convert my object to xml format and store it as string with the other fields( changetime, id…)