Question :
I parse xml documents with to_tsvector()
function, and, sometimes, it produces tokens, which are shorter than 3 characters:
'1':89,91 '2019':14 '25':4
I know, that
to_tsvector([ config regconfig, ] document text) returns tsvector
accepts config as first param, but i can’t find solution for setting a minimal token length here. Is there any way to do it?
Answer :
Tokens are initially produced by the full text search parser bound to the text configuration. The default text search parser that ships with PostgreSQL is not configurable (it can be replaced by a custom parser, though).
After being output by the parser, tokens can be filtered out with dictionaries. It’s relatively easy to create a dictionary that filters out short words, but it takes the form of a couple functions written in the C language.
As an example, here’s a blog post explaining how to write a custom dictionary to filter out long words: Text search: a custom dictionary to avoid long words.
If there are only a reasonable number of short words that you want to filter out, another option would be to enumerate them all in a stop-words file and use the built-in simple dictionary. From the documentation:
The simple dictionary template operates by converting the input token
to lower case and checking it against a file of stop words. If it is
found in the file then an empty array is returned, causing the token
to be discarded. If not, the lower-cased form of the word is returned
as the normalized lexeme