MySQL – Server collation utf8_unicode_ci vs table collation utf8_bin: compatibility and performance

Posted on

Question :

I’m not sure whether this belongs to SE DBA, but it sounds more logical …

Our MySQL 5.5.32 on Ubuntu 12.04 64 uses utf8_unicode_ci for the Server collation and some tables, like phpBB3, uses utf8_bin for all of their tables. This has been since start.

I read utf8_bin is (at least in theory) faster since no conversion and/or normalization is done, but are these quite different UTF8 thingies fully compatible ? If they are, is there performance hit or improvement ? The tables I’m planning switching from utf8_unicode_ci to utf8_bin contain integers and dates, so not having to bother with much larger UTF8 should drop at least 1ms when dealing with these large ASCII-only tables.

Before I had all tables not already using UTF8 Unicode Ci or UTF8 Binary I was using ASCII Binary for these tables, but decided to switch to UTF8 Unicode Ci for full compatiblity with Server collation (and of course a bit of performance improvement).

Answer :

By using “UTF8 thingies” I think that the problem here would be concepts. Let’s review them:

  • Encoding (badly called character set in MySQL): utf-8 (utf8 and utf8mb4), utf-16 (utf16), 7-bit ASCII (ascii)… Those are the actual formats in which the characters are stored, transmitted or converted. They can use (in MySQL, from 1 byte to 4, and some are dynamic, while others are fixed in size)
  • Collation: these are the default algorithms that, when nothing is explicitly said, MySQL will use to compare 2 strings. the _bin one basically compare bit-to-bit, while the _ci and _cs will compare for a particular language, in a case-insensitive or case-sensitive way.

Now we have presented the concepts, let’s apply them: First, you should use the most appropriate format for your application. If your application needs to store international names, you definitely want to use a multi-byte encoding. And given that even the most local business are globalised something like utf8 is practically a must. However, there are some cases in which has a lot of sense to use different encodings. Storing something like a car plate, may only require ascii, which not only is cheaper in size, but would prevent from other strange characters to get in. Other examples include Wikipedia, a high-traffic site that has to support perfectly tons of different languages. They decided (I suppose partly to overcome MySQL limitation, partly for efficiency) to store all the pages without encoding, in pure blobs.

Regarding efficiency: having the same encoding in all the tiers of your application is usually the best policy: you avoid unnecessary conversions back and forth. Smaller and fixed size encodings are supposed to be faster but never compromise functionality. Sometimes the bottleneck on encodings can be a slow conversion, sometimes on storing many zeros. Compression can always help with the last one, specially with text.

And now about collation: basically use the one you will need by default- it will affect the string comparisons and order of the sorts. Different collations do not change the bits for each row, but it (may) change the order between them (is ‘a’ before ‘A’ or are they equivalent?). Although you can force a different collation for particular queries manually:

mysql> SELECT mycolumn FROM mytable WHERE mycolumn LIKE 'a%' COLLATE utf8_bin;
mysql> SELECT mycolumn FROM mytable ORDER BY mycolumn COLLATE utf8_bin;

Why you may have seem many people saying that the _bin version may be faster? Well, because it only has to compare bits, and not have into account the particularities of each language. If you are not going to ever do comparisons in which a == A and even a == A == á == ä, use the binary collation. But in those cases, you probably do faster by not storing text in char/varchar/text, but in the binary equivalents: varbinary/binary/blob. Of course, if you want to get the advantages of storing characters and not bytes, like getting those comparisons done automatically done for you, use utf8_general_ci or utf8_unicode_ci, which will work for most languages well.

Given that most of your data is ASCII, the size in utf8 shouldn’t have changed much. Check your performance- I have suffered some weird utf8 bugs, but usually this shouldn’t be a common problem unless having to deal with lots of 4-byte utf-8 characters. Given that most of your data is ascii-like, and that cpu is usually not the bottleneck (and IO will probably won’t change), you may not see any change, so use the combination that better fits your needs. But test it before and after to see the change.

Remember that there are several configurations on mysql: the character set for the connections, the one for the table storage and the one in which they are sent back to the client. You can check both the encoding and the collations of a particular session by doing:

mysql> show session variables like '%character_set_%';
mysql> show session variables like '%collation_%';

Some of them are not that important, like the internal format of the identifiers.

Finally, remember that the configuration of the charset and collation for databases and tables are defaults, and that changing those don’t modify the actual column encoding. Also, be careful when ALTERing the tables, as you block writes to that table.

Leave a Reply

Your email address will not be published. Required fields are marked *