Reasons for avoiding large ID values

Posted on

Question :

We are working on a web application, not yet accessible to users. My boss noticed that newly created records get an ID of over 10 000, even though we only have under 100 records in the table. She assumed that the web interface for some reason creates over a 100 times more temporary records than actual ones (and deletes them) and that this can lead us to running out of range within a few months of release.

I don’t think she is correct about the cause of the ID inflation (the colleague who can answer this is on vacation, so we don’t know for sure), but let’s assume that she is. She said that she’d hate to use a bigint column, and that she’d like us to stop autoincrementing the ID column and write server-side code which chooses the first “unused” integer and uses it as an ID.

I am a computer science grad student with little practical experience, filling a junior developer role. She has years of experience of managing all of our organisation’s databases, and designing most of them. I think that she’s incorrect in this case, that a bigint ID is nothing to be afraid of, and that mimicking the DBMS functionality smells of an antipattern. But I don’t trust my judgment yet.

What are the arguments for and against each position? What bad things can happen if we use a bigint, and what are the dangers of reinventing the wheel autoincrementing functionality? Is there a third solution which is better than either one? What could her reasons be for wanting to avoid an inflation of ID face values? I’m interested in hearing about pragmatic reasons too – maybe bigint IDs work in theory, but cause headaches in practice?

The application is not expected to handle very large amounts of data. I doubt that it will reach 10 000 actual records within the next few years.

If it makes any difference, we are using Microsoft SQL server. The application is written in C# and uses Linq to SQL.

Update

Thank you, I found the existing answers and comments interesting. But I’m afraid you misunderstood my question, so they contain what I wanted to know.

I’m not really concerned about the real reason for the high IDs. If we can’t find it on our own, I could ask a different question. What I’m interested in is to understand the decision process in this case. For this, please assume that the application will be writing 1000 records per day, then deleting 9999 of them. I’m almost sure this is not the case, but this is what my boss believed when she made her request. So, under these hypothetical circumstances, what would be the pros and cons of either using bigint or writing our own code which will assign IDs (in a way which reuses the IDs of already deleted records, to ensure there are no gaps)?

As for the actual reason, I strongly suspect that this is because we once wrote code to import data from another database, as a proof of concept that a later migration can be done to a certain extent. I think my colleague actually created several thousand records during the import and later deleted them. I have to confirm if this was actually the case, but if it is, there is not even need for action.

Answer :

Without seeing code, it is pretty hard to say conclusively what is happening. Although, most likely the IDENTITY value is being cached, causing gaps in the value after SQL Server is restarted. See https://stackoverflow.com/questions/17587094/identity-column-value-suddenly-jumps-to-1001-in-sql-server for some good answers and info about that.

A simple INT field can hold values up to 2,147,483,647. You can actually start the identity value at -2,147,483,648, giving a full 32 bits of values. 4 Billion distinct values. I doubt very much you’re going to run out of values to use. Assuming your application is consuming 1,000 values for each actual row added, you’d need to be creating nearly 12,000 rows per day every day to run out of IDs in 6 months assuming you started the IDENTITY value at 0, and were using an INT. If you were using a BIGINT, you would have to wait 21 million centuries before you ran out of values if you wrote 12,000 rows per day, consuming 1,000 “values” per row.

Having said all that, if you wanted to use BIGINT as the identity field data type, there is certainly nothing wrong with that. That’ll give you for all intents-and-purposes, a limitless supply of values to use. The performance difference between an INT and a BIGINT is practically non-existent on modern 64-bit hardware, and highly preferable over for-instance using NEWID() to generate GUIDs.

If you wanted to manage your own values for the ID column, you could create a key table, and provide a pretty bulletproof way of doing that using one of the methods shown in the answers on this question: Handling concurrent access to a key table without deadlocks in SQL Server

The other option, assuming you’re using SQL Server 2012+, would be to use a SEQUENCE object to get ID values for the column. However, you’d need to configure the sequence to not cache values. For example:

CREATE SEQUENCE dbo.MySequence AS INT START WITH -2147483648 INCREMENT BY 1 NO CACHE;

In answer to your boss’ negative perception of “high” numbers, I would say what difference does it make? Assuming you use an INT field, with an IDENTITY, you could in fact start the IDENTITY at 2147483647 and “increment” the value by -1. This would make absolutely no difference to the memory consumption, performance, or disk space used since a 32 bit number is 4 bytes, no matter if it is 0 or 2147483647. 0 in binary is 00000000000000000000000000000000 when stored in a 32-bit signed INT field. 2147483647 is 01111111111111111111111111111111 – both numbers take precisely the same amount of space, both in memory, and on disk, and both require precisely the same amount of CPU operations to process. It is far more important to get your application code designed correctly than to obsess about the actual number stored in a key field.

You asked about the pros and cons of either (a) using a larger-capacity ID column, such as a BIGINT, or (b) rolling your own solution to prevent ID gaps. To answer these concerns:

  1. BIGINT instead of INT as the data-type for the column in question. Using a BIGINT requires double the amount of storage, both on-disk, and in-memory for the column itself. If the column is the primary key index for the table involved, each and every non-clustered index attached to the table will also store the BIGINT value, at twice the size of an INT, again both in-memory and on-disk. SQL Server stores data on disk in 8KB pages, where the number of “rows” per “page” depends on the “width” of each row. So, for instance, if you have a table with 10 columns, each one an INT, you’d be approximately able to store 160 rows per page. If those columns where instead BIGINT columns, you’d only be able to store 80 rows per page. For a table with a very large number of rows, this clearly means I/O required to read and write the table will be double in this example for any given number of rows. Granted, this is a pretty extreme example – if you had a row consisting of a single INT or BIGINT column and a single NCHAR(4000) column, you’d be (simplistically) getting a single row per page, whether you used an INT or a BIGINT. In this scenario, it would not make much appreciable difference.

  2. Rolling your own scenario to prevent gaps in the ID column. You’d need to write your code in such a way that determining the “next” ID value to use does not conflict with other actions happening to the table. Something along the lines of SELECT TOP(1) [ID] FROM [schema].[table] naively comes to mind. What if there are multiple actors attempting to write new rows to the table simultaneously? Two actors could easily obtain the same value, resulting in a write-conflict. Getting around this problem requires serializing access to the table, reducing performance. There have been many articles written about this problem; I’ll leave it to the reader to perform a search on that topic.

The conclusion here is: you need to understand your requirements and properly estimate both the number of rows, and the row width, along with concurrency requirements of your application. As usual, It Depends™.

The main task to do is to find the root cause why the current value is that high.

The most reasonable explanation for SQL Server versions prior to SQL2012 -assuming you’re talking about a test database- would be that there was a load test followed by a cleanup.

Starting with SQL2012 the most probable reason is due to several restarts of the SQL Engine (as explained in the first link Max provided).

If the gap is caused by a test scenario, there’s no reason to worry from my point of view. But to be on the safe side I’d check the identity values during normal use of the application as well as before and after an engine restart.

It’s “funny” that MS states that both alternatives (either trace flag 272 or the new SEQUENCE object) might impact performance.

It might be the best solution to use BIGINT instead of INT just to be on the safe side to cover MS next “improvements”…

Rumtscho,
If you’re only creating 1000 rows per day, there is little to decide–use the INT data type with an Identity field and be done with it. Simple math says if you give your app a 30 year lifecycle (unlikely) you could have 200,000 rows per day and still be within the positive number range of an INT data type.

Using BigInt is overkill in your case, it also can cause issues if your app or data will be accessed via ODBC (such as brought into Excel or MS Access, etc), Bigint doesn’t translate well over most ODBC drivers to desktop apps.

As for GUIDS, aside from the extra disk space and the extra I/O, there is the huge problem that they are by design not sequential, so if they are part of a sorted index, well you can guess that every insert is going to require the index to be resorted.
–Jim

There is a gap between the used values ? Or the starting values is 10.000 and from then on all are adding 1 ?
Sometimes if the number is going to be given to customers, the initial number is greater than zero, let’s say 1500 for example, so the customer do not realize the system is “new”.

The drawback of using bigint instead of smallint is that as bigint uses “more disk space”, when disk reading you read less disk blocks for every disk. If your row space is small, then this can be a drawback, if not they is does not matter much. Also it does not matter much if you are not querying for a lot of resources at once and if you have the proper indexes.

And as said in other response, if you worry about running out of indexes, then you should not worry, smallint can handle unless you have a millionaire business. Inventing a mechanism to “recover ids” is expensive and adds failure points and complexity to the software.

Regards

If I was your boss I would be most interested in the reasons for the unexpectedly high Id values… the way I see it, for each of the two scenarios you outlined:

  1. IF prior testing has bumped-up identity values – then your other comments about expected numbers of records would also push me to suggest a smaller key type. Frankly I’d also consider if it was possible to reset the sequence and renumber existing records if the test was out-of-character for the current intended use of the table (most would consider this overkill – ‘it depends’).

  2. IF the majority of records written to the table are deleted soon after I would be inclined to consider using two tables instead; a temporary table where records are not kept long-term, and another where only records we will create permanently are kept. Again, your expectations for number of long-term records suggest to me the use of a smaller type for your key column, and a few records per day will hardly cause you a performance issue to ‘move’ a record from one table to another similar one. I suspect that it is not your scenario, but imagine that a shopping website may prefer to maintain a Basket / BasketItem and when an order is actually placed the data is moved into Order / OrderItem set.

To summarise; in my opinion BIGINTs are not necessarily to be feared, but are frankly needlessly large for many scenarios. If the table never gets large you will never realise that there was overkill on your choice of type… but when you have tables with millions of rows and many FK columns that are BIGINT when they could have been smaller – then you may wish the types had been selected more conservatively (consider not only the key columns, but all the foreigh key columns, and all the backups you keep, and so on!). Disk space is not always cheap (consider SAN disk at managed locations – i.e. the disk space is rented).

In essence I am arguing for careful consideration of your selection of data type always rather than sometimes. You won’t always predict usage patterns correctly, but I think you will make better decisions as-a-rule then always assuming that ‘bigger is better’. In general I select the smallest type that can contain the required and reasonable value range and I will happily consider INT, SMALLINT and even TINYINT if I think the value is likely to fit in that type for the forseeable future. The smaller types are unlikely to be used with IDENTITY columns however, but may happily be used with lookup tables where key values are set manually.

Finally, the technologies people use can influence their expectations and answers considerably. Some tools are more likely to cause gaps in ranges e.g. by pre-booking ranges of identities per process. In contrast @DocSalvager suggests a thorough auditable sequence which seems to reflect your boss’ viewpoint; I personally have never required quite that level of authority – though the general rule that identities are sequential and generally without gaps has often been incredibly useful to me in support situations and problem analysis.

what would be the pros and cons of either using bigint or writing our own code which will assign IDs (in a way which reuses the IDs of already deleted records, to ensure there are no gaps)?

Using bigint as an identity and living with the gaps:

  • it’s all in-built functionality
  • you can be sure it will work out-of-the-box
  • it will waste space since int would still give you about 2M days’ data; more pages will have to be read & written; indexes may become deeper. (At these volumes this is not a significant concern however).
  • a surrogate key column is meant to be meaningless so gaps are OK. If it’s shown to the users and gaps are interpreted as significant then you’re doing it wrong.

Roll your own:

  • your development team will be doing all the development and bug fixing work forever.
  • do you just want to fill gaps at the tail or in the middle, too? Design decisions to argue over.
  • every write will have to issues strong locks to prevent concurrent processes acquiring the same new ID, or resolve conflicts post facto.
  • worst case you will have to update every row in the table to close the gaps if rowid=1 is deleted. This will hammer concurrency and performance, what with all the cascading foreign key updates etc.
  • lazy or eager gap-filling? What happens to concurrency while this is happening?
  • you will have to read for the new ID before any write = additional load.
  • an index will be needed on the id column for efficient gap finding.

Leave a Reply

Your email address will not be published. Required fields are marked *