Automatic deduplication (normalization) of strings in MySQL

Posted on

Question :

The question is whether MySQL can use separate physical storage for strings transparently, but with the same operations that are applicable to varchar, including %.

I have a large number of duplicated strings in my MySQL database: for example, many customers have the same first name. To save space, especially in indexes, and thus speedup the queries, I would prefer to physically store only an ID and store the actual string elsewhere: instead of varchar(30)s

| Mary | Smith  | New York |
| Mary | Cooper | Chicago  |
| John | Cooper | New York |

I want to only store ints

| 1 | 2 | 3 |
| 1 | 4 | 5 |
| 6 | 4 | 3 |

where the IDs refer to the strings:

| 1 | Mary     |
| 2 | Smith    |
| 3 | New York |

I can do it manually and use joins to map between strings and IDs in each query, but my question is whether there is any property of a field to tell MySQL to handle this transparently for me, so that in the queries I operate with the table as if it contained strings but the underlying physical storage for strings would be separate?

In my case the DB is rarely (if any) updated but frequently read, the queries being rather complex and slow, so I need to optimize for select, not for update.

Answer :

MySQL has no such feature.

It is left to the user to “normalize” the data either for avoiding having to update multiple spots and/or for saving space.

In your example, it is generally not practical to do such with first/last name. But it may be advisable for “locations”.

It is not practical to dedup names (for example) because the payoff is poor. With “Mary” sitting somewhere else, the code has to implicitly (as you hypothecate) or explicitly (via a JOIN) reach for the string. In large datasets, this is likely to cause an extra disk hit — costly. Also, “Mary” is 4 characters; the first_name_id might be a MEDIUMINT UNSIGNED, which is 3 bytes — not much savings. For bigger strings that repeat a lot (company names), the tradeoffs might be better.

The main purpose of “normalization” is to put things in a single location. A company name should be spelled out only once in a system that talks about companies. In its place an id would be used — possibly an integer, possibly a ticker symbol (as in a Stock database). When the company changes its name, only one spot needs to be changed. (If the ticker changes, as happened with AOL, then the code is messy.)

I have not used it myself, but it seems like MySQL offers compression which would act like deduplication:

Page-compression seems to be very easy use, but it might have some problems:

Table-compression seems to be more complicated to use:

Leave a Reply

Your email address will not be published. Required fields are marked *