I want to be able to search unaccented phrases in an inflected (Polish) language in Postgres.
Say, if a document contains
robiłem, the lexeme should be
robić (the infinivite). Its forms are
robiła and so on. I want to be able to find it, for example, with a phrase
robie which is unaccented
What I did is I started out with a perfectly well working polish text search config
CREATE TEXT SEARCH DICTIONARY polish_ispell ( TEMPLATE = pg_catalog.ispell, dictfile = 'polish', afffile = 'polish', stopwords = 'polish' );
Then I tried to extend it to include the
create extension unaccent; create text search configuration polish_unaccented (copy = polish); ALTER TEXT SEARCH CONFIGURATION polish_unaccented ALTER MAPPING FOR hword, hword_part, word WITH unaccen, polish_ispell, simple, ;
Sadly, lexems are not created correctly with this config:
select to_tsvector('polish_unaccented' ,'robił'); 'robil':1
The lexem should be of course:
So the below cant return true (and that’s what I need I think):
select to_tsvector('polish_unaccented','robić') @@ to_tsquery('polish_unaccented','robie');
I’ve googled but did not find any documents showing how to really configure Postgres for my case. The docs only show the lame ‘Hôtels’ example, which is not a ‘lexemed’ word.
AFAIK, you cannot do what you want with current PostgreSQL full text configurations (dictionaries and parsers/lexers), although some workarounds might do something close to the trick.
I don’t know Polish, but I’ve had similar problems with Spanish (that also has conjugations, etc.), and with the fact that people has got completely accostumed to the fact that Google is able to ignore accents, and they just frequently ignore them as well.
You can have several dictionaries for PostgreSQL, which can do different things, but basically simplify in some fashion your texts. What you want to do if you want to convert words to its lexemes is to use an ISpell dictionary:
[…] which can normalize many different linguistic forms of a word into the same lexeme. For example, an English Ispell dictionary can match all declensions and conjugations of the search term bank, e.g., banking, banked, banks, banks’, and bank’s.
However, this dictionary (at least in Spanish) will only recognize properly accented words, or it won’t be able to find out to which lexeme a certain (possibly accented) word corresponds to. This is because the entries in the ISpell dictionaries, such as the Polish ISpell dictionary are written with all the proper accents (or diacriticals) [as it should].
The ISpell Polsih dictionary is composed of two files, one called
pl_PL.dic (encoded as ISO-8859-2, as far as I’ve been able to guess) and one called
pl_PL.aff. The first one contains lexemes + rules that apply to them, the second one contains the meanings of those rules. The ISpell software interprets those files to figure out how to transform words into its lexemes [and also how to check whether a spelling is correct or not].
The entries on the
.dic file look like:
abecadło/UV abecadłowy/bxXyY [...] Abisyńczyk/NOqsT abisyński/XxYbyc
.aff file gives rules for the meanings of the “U” and “V” and all the other letters than follow the
/ sign. Some if this rules (which I am far from knowing) will thell the software how suffixes or prefixes work for the word abecadło. For instance:
SFX U ło le [^astz]ło
As there is no word like
abisynski in this dictionary, if you enter those texts into your search, the dictionary will not return any lexeme.
Possible workaround: Manipulate the dictionary file and duplicate all lines with accented characters with the unaccented equivalent. You probably would need to do something similar with the
.aff part of the dictionary.
Probably, you would also need to use a Synonym dictionary to make the accented and unaccented versions of all words have the same meaning.
This is a brute force approach, and, in practice you’d be inventing a “new version of the Polish language, where accented letters are equivalent to their unaccented counterparts”. [Don’t tell the makers of dictionaries who made them to make sure people would spell correctly ;-)].
I think this approach does have very many risks. I know I wouldn’t do it in either Spanish or Catalan, because the presence or absence of a diacritical mark can radically change the meaning of a word (“año” doesn’t have much to do with “ano”, in Spanish; and considering them synonyms is extremely delicate).
You’ll have to evaluate whether this applies to Polish or not.
Alternative: you can just use a combination of a Simple Dictionary and the “filtering”
unnacent module. You won’t get lexemes, and the transformations this combination is able to do are not so sophisticated… but you’ll get the same result when you search for abecadło or abecadlo.
In my case, I ended up settling for this solution.
Second alternative: If you need a text search which has the capability to ignore accents, allow for small misspellings and have a lot of sophisticated possibilities, consider using a solution out of the database, such as Apache Solr. It’s obviously a very different approach, and you need some process to make it synchronized with the database.
Wow, this was fun. So I wrote a program to do this for you called
pg_hunspell pl PL polish SELECT to_tsvector('polish' ,'robił'); to_tsvector ------------- 'robić':1 (1 row)