Why use both universal identifiers and separately incremented primary keys for strong entities?

Posted on

Question :

There is a good public example for this: The relational model of ChEMBL (A bioactive molecule database). Here all strong entities (such as molecules) have a database-wide unique identifier which also serves as an universal identifiers: ChEMBL ID (such as CHEMBL25 for ASPIRIN). These strong entities also have their own incremental bigint primary keys (Such as 1280 for the same Aspirin entity)

The model has the chembl_id_lookup which holds these two identifiers, and the additional high level metadata for these strong entities. The strong entity relations also stores both of these identifiers.

However isn’t this a bit redundant:

Why don’t they just use the numeric component of ChEMBL ID as strong entity primary keys. For example CHEMBL25 can be transformed into 25 as a strong entity primary key? It would be still unique. They already used bigint, which surely can hold all their entities. Do their method have any performance, integrity or clarity benefits over the proposed alternative?

Additional notes:

Their latest database dumps and schema diagram can be downloaded from here.

Update 1

I ask this because I develop relational data model from scratch for biomedical research which is bit wider scope than ChEMBL’s. Still ChEMBL is close enough to be an one of my inspirations. So I am not planning to change ChEMBL’s schema.

I plan to narrow db interactions to calling stored procedures. So universal identifier —-> surrogate bigint key transformation will be transparent for users and applications

Update 2

The proposed alternative does not change the fact that integers are used as surrogate primary keys, but instead of generating them for each tables as identities, a globally unique integer sequence would be generated for all strong entities. And if you concatenate this number with your prefix then it will be the universal identifier.

This kind of handling of identifiers is used for identify genes in the Enterez platform of the National Center of Biological Information. For example the gene CAN1 has the GeneID:856646 for its universal identifier ready to be referenced by external systems and its numeric part 856646 is it unique identifier (UID) inside their database. However I do not know what kind of database is used there.

Answer :

Why don’t they just use the numeric component of ChEMBL ID as strong entity primary keys. For example CHEMBL25 can be transformed into 25 as a strong entity primary key?

It could but then, any time that you wanted to show that value to a User, or send it to some other, ChEMBL-aware application, you’d have to add the prefix back on again and any time you read in such a value from “elsewhere”, you’d have to tear it apart and get just the numeric part. Yes, it’s do-able but each point of transformation runs the risk of somebody “missing” one and sending out short values or doubling-up the prefix.

And then there’s that dreadful day when they decide to change to some new, “standard” format (obligatory XKCD reference). Then you’d have to find a way of mapping between your truncated values and the new ones, instead of just updating the externally-visible values, old to new.

The big advantage of the bigint Surrogate key is its size – only 8[-ish] bytes in every table/index instead of the length of each CHEMBL ID.

I’d recommend using the full identifier for use “outside” the database, but tie records together using the internal (never seen “outside”), numeric, surrogate key.

Significant reason for separate integer ID is performance.

Index searches used by grouping, ordering an joining are based on comparisons. Integers are compared in a single operation while character-based values are compared in a cycle, char by char. Multibyte charsets need more than one iteration for each char. Say two strings in UTF8MB4 encoding each 8 chars long requires ~12 comparisons in average to determine which string lexigraphically precedes another.

Another reason is the disk footprint of the index. Index for CHAR(255) utf8mb4 is way bigger than for BIGINT.

All that doesn’t matter for relatively small databases but it’s a common pattern to have a dedicated integer ID instead of naturally unique non-integer candidate key.

Leave a Reply

Your email address will not be published. Required fields are marked *