How to set minimal token length for to_tsvector() function?

Posted on

Question :

I parse xml documents with to_tsvector() function, and, sometimes, it produces tokens, which are shorter than 3 characters:

'1':89,91 '2019':14 '25':4 

I know, that

to_tsvector([ config regconfig, ] document text) returns tsvector

accepts config as first param, but i can’t find solution for setting a minimal token length here. Is there any way to do it?

Answer :

Tokens are initially produced by the full text search parser bound to the text configuration. The default text search parser that ships with PostgreSQL is not configurable (it can be replaced by a custom parser, though).

After being output by the parser, tokens can be filtered out with dictionaries. It’s relatively easy to create a dictionary that filters out short words, but it takes the form of a couple functions written in the C language.
As an example, here’s a blog post explaining how to write a custom dictionary to filter out long words: Text search: a custom dictionary to avoid long words.

If there are only a reasonable number of short words that you want to filter out, another option would be to enumerate them all in a stop-words file and use the built-in simple dictionary. From the documentation:

The simple dictionary template operates by converting the input token
to lower case and checking it against a file of stop words. If it is
found in the file then an empty array is returned, causing the token
to be discarded
. If not, the lower-cased form of the word is returned
as the normalized lexeme

Leave a Reply

Your email address will not be published. Required fields are marked *