Question :
Let’s say I have a table with 1 000 000 records.
Structure of table is:
create table individual (
id serial primary key,
surname varchar(128),
"name" varchar(128),
patronymic varchar(128),
birth_dt date
)
I create a composite index.
INDEX 1
create index on individual using btree (upper(surname) varchar_pattern_ops
, upper("name") varchar_pattern_ops, patronymic varchar_pattern_ops, birth_dt);
Docs state that varchar_pattern_ops
should be applied when using LIKE
or pattern match in query. Conclusion: this index will not be used in query below, even it gets only 10 row from 1 000 000.
QUERY 1
select * from individual
order by upper(surname), upper("name"), upper(patronymic), birth_dt limit 10;
and even more, docs recommends to create an index without varchar_pattern_ops
as well.
INDEX 2
create index on individual using btree
(upper(surname), upper("name"), upper(patronimyc), birth_dt);
Then a query using LIMIT
will use this index.
I found a cheat to force Postgres to use first index on Postgres users forum. It is operator ~<~
.
QUERY 2
select * from individual
order by upper(surname) using ~<~
, upper("name") using ~<~
, upper(patronymic) using ~<~
, birth_dt limit 100;
In this case INDEX 1 will be used even if INDEX 2 doesn’t exists. I tried to investigate to discover why it happens, but failed.
There are some system tables like pg_operator
, which (I think) link operator ~<~
to some functions that most probably uses LIKE
or regular expressions.
I ran QUERY 1 and QUERY 2 a few times and compared result manually. It looks like operator ~<~
gives correct result, but I didn’t risk anything and just create a normal index anyway.
I am still interested how the Postgres planner decides which index to use index where it meets the operator ~<~
in QUERY 1.
Answer :
This revolves around COLLATION
rules (which are part of your locale). Per documentation:
The
C
andPOSIX
collations both specify “traditional C” behavior,
in which only the ASCII letters “A
” through “Z
” are treated as
letters, and sorting is done strictly by character code byte values.
Bold emphasis mine. These internal operators:
~<~
, ~<=~
, ~>=~
, ~>~
are members of the operator classes text_pattern_ops
, varchar_pattern_ops
and bpchar_pattern_ops
which ignore current collation rules and default to “traditional C” behavior. Per documentation:
The difference from the default operator classes is that the values
are compared strictly character by character rather than according to
the locale-specific collation rules.
That’s why they can use your first index, while their locale-aware equivalents
<
, <=
, >=
, >
would need the second (using the default operator class).
ORDER BY
defaults to the current collation rules. Unless you happen to work with the “C” collation the second set of operators is used for sorting, which can only use indexes with a matching collation (default btree indexes).
Notably, pattern-matching generally relies on “traditional C” behavior, that’s why a left anchored LIKE
match can use a _patern_ops
index with a non-C locale, but not a default index. Compare (last chapter):
In Postgres 9.1+ you can force a selected sort order with COLLATE "locale_name"
(the locale has to be available in your system!). So if you run your query with COLLATE "C"
(which is a much better option than the evil hack using ~<~
) , it can use the _pattern_ops
index:
SELECT * FROM individual
ORDER BY upper(surname) COLLATE "C"
, upper("name") COLLATE "C"
, upper(patronymic) COLLATE "C"
, birth_dt
LIMIT 100;
Results are different. Try with non-ASCII letters like ö
, é
, etc. Those typically sort depending on your current collation rules, while letters a
–z
typically sort the same in all locales.