The charset defines how the row data is saved into the file store.
Row 17: <CHARSET('utf8', 'hello'), 17, CHARSET('latin-1', 'sup')>
But it seems that collation only affects sorting of these rows.
So why is the collation specified when you are defining the column?
Instead I would expect that you would only specify the collation when defining things that matter with sorting rows, i.e. indexes (because they are stored sorted), queries (because they test equality and inequalities) and constraints (because they are effectively queries).
Unless MySQL / MariaDB does something that I’m unaware of (being that I typically focus on SQL Server): because encoding (i.e. “character set”) is handled separately, column-level collation should simply be a default used for those operations that you mentioned (creating indexes, queries, etc), and comparisons, so that you don’t need to specify the collation each and every time (strings cannot be either compared or sorted if the specific rules for doing so are not provided). That would not only greatly complicate queries, it would make it nearly impossible to provide consistency across queries as not having it declared at the column level (or at least at the DB level) would leave it to query developers to remember the intention of the data modeler, and that’s certainly not a recipe for success ;-).
If not specified at the column level, collation needs to at least be specified at the database level or even the instance level (and I believe there are RDBMSs that do specify at either level, and maybe don’t even allow for specifying at the column level).
Collation just needs to be specified somewhere, else you can’t even rebuild an index (let alone create it) without providing the collation. And, allowing the granularity of specifying collation at the column level allows for maximum flexibility as different string columns within a table, or between tables in the same database, can easily have different needs / requirements.
And, just as a side note: for RDBMSs such as SQL Server that combine encoding along with collation in a singular “Collation” (and no way to separate the two), it is a necessity to specify at the column level as it indicates the encoding (at least for
NTEXT are always UTF-16 LE).
(OP) But it seems that collation only affects sorting of these rows.
That’s not true.
CREATE TABLE test1 (txt VARCHAR(4) COLLATE 'latin1_general_ci') SELECT 'Test' txt UNION ALL SELECT 'TEST' UNION ALL SELECT 'tEsT';
CREATE TABLE test2 (txt VARCHAR(4) COLLATE 'latin1_general_cs') SELECT 'Test' txt UNION ALL SELECT 'TEST' UNION ALL SELECT 'tEsT';
SELECT COUNT(DISTINCT txt) FROM test1;| COUNT(DISTINCT txt) | | ------------------: | | 1 |
SELECT COUNT(DISTINCT txt) FROM test2;| COUNT(DISTINCT txt) | | ------------------: | | 3 |
The same while specify collation in a query:
SELECT COUNT(DISTINCT CONVERT(txt USING 'latin1_general_c?')) FROM test;
(OP) why is the collation specified when you are defining the column?
Collation granularity is single value.