Let’s say I have a table with 1 000 000 records.
Structure of table is:
create table individual ( id serial primary key, surname varchar(128), "name" varchar(128), patronymic varchar(128), birth_dt date )
I create a composite index.
create index on individual using btree (upper(surname) varchar_pattern_ops , upper("name") varchar_pattern_ops, patronymic varchar_pattern_ops, birth_dt);
Docs state that
varchar_pattern_ops should be applied when using
LIKE or pattern match in query. Conclusion: this index will not be used in query below, even it gets only 10 row from 1 000 000.
select * from individual order by upper(surname), upper("name"), upper(patronymic), birth_dt limit 10;
and even more, docs recommends to create an index without
varchar_pattern_ops as well.
create index on individual using btree (upper(surname), upper("name"), upper(patronimyc), birth_dt);
Then a query using
LIMIT will use this index.
I found a cheat to force Postgres to use first index on Postgres users forum. It is operator
select * from individual order by upper(surname) using ~<~ , upper("name") using ~<~ , upper(patronymic) using ~<~ , birth_dt limit 100;
In this case INDEX 1 will be used even if INDEX 2 doesn’t exists. I tried to investigate to discover why it happens, but failed.
There are some system tables like
pg_operator, which (I think) link operator
~<~ to some functions that most probably uses
LIKE or regular expressions.
I ran QUERY 1 and QUERY 2 a few times and compared result manually. It looks like operator
~<~ gives correct result, but I didn’t risk anything and just create a normal index anyway.
I am still interested how the Postgres planner decides which index to use index where it meets the operator
~<~ in QUERY 1.
This revolves around
COLLATION rules (which are part of your locale). Per documentation:
POSIXcollations both specify “traditional C” behavior,
in which only the ASCII letters “
A” through “
Z” are treated as
letters, and sorting is done strictly by character code byte values.
Bold emphasis mine. These internal operators:
are members of the operator classes
bpchar_pattern_ops which ignore current collation rules and default to “traditional C” behavior. Per documentation:
The difference from the default operator classes is that the values
are compared strictly character by character rather than according to
the locale-specific collation rules.
That’s why they can use your first index, while their locale-aware equivalents
would need the second (using the default operator class).
ORDER BY defaults to the current collation rules. Unless you happen to work with the “C” collation the second set of operators is used for sorting, which can only use indexes with a matching collation (default btree indexes).
Notably, pattern-matching generally relies on “traditional C” behavior, that’s why a left anchored
LIKE match can use a
_patern_ops index with a non-C locale, but not a default index. Compare (last chapter):
In Postgres 9.1+ you can force a selected sort order with
COLLATE "locale_name" (the locale has to be available in your system!). So if you run your query with
COLLATE "C" (which is a much better option than the evil hack
using ~<~) , it can use the
SELECT * FROM individual ORDER BY upper(surname) COLLATE "C" , upper("name") COLLATE "C" , upper(patronymic) COLLATE "C" , birth_dt LIMIT 100;
Results are different. Try with non-ASCII letters like
é, etc. Those typically sort depending on your current collation rules, while letters
z typically sort the same in all locales.