Query for similar syllables on a field

Posted on

Question :

I have a table called toponim. This table contain field called namspe. I would like to find words in this field that contain similar syllables such as Toli-toli, Bagan Siapi-api, Fakfak, Oro-oro Ombo.

At first I use this query, but it won’t work if the word that contains similar syllables is in the middle or in the end of the sentence. Any ideas?

select namlok, namspe 
from toponim
where left(namspe,4) = substring (namspe,5,4);

enter image description here

Answer :

What version of Postgres are you using?

If I understand correctly, you would like to match records with similar words. There are several ways to approach this. I’d suggest taking a look at a few features and seeing if one sounds like a good match for you.

The pg_trgrm extension included with standard distributions is worth considering:
https://www.postgresql.org/docs/current/pgtrgm.html

This tool breaks words into a chunks of three letters, and then (with the help of an index) quickly compares two strings to see how similar they are, based on the three letter chunks.

Trigrams are a good general-purpose approach, but you sound like you are matching on real words, not just random strings. What you describe sounds like the process used to prepare words for a “lexical” Full Text Search in Postgres:

https://www.postgresql.org/docs/current/textsearch.html

The idea with a “lexical” analysis is that strings are converted into a common set of base words. For example, in English, singular, plural, and different tenses are converted into a common word to simplify certain kinds of matches. Sometimes the “word” isn’t even a real word, it’s just a common root-like word used for this purpose.

In your case, I believe that ‘Botol-botol’ and ‘Botol’ would both be converted to ‘botol’. Likewise, “merdeka-merdeka” and “merdeka” could be translated to “merdeka” for comparison. It’s even possible to have a thesaurus, so perhaps “mandiri” could also be translated to “merdeka”. For all of this to work automaticall requires an Indonesian dictionary for Postgres. I just checked my installation, and do not find Malay or Indonesian:

SELECT dictname FROM pg_catalog.pg_ts_dict;

Perhaps someone knows how to add Indonesian to Postgres?

If there is no suitable dictionary for your language available, you may be able to make progress with one of the installed dictionaries. There you will get a list of words parsed. There’s only one text parser in Postgres, and I do not believe that it will break words on the – character, which would be useful in your case. To work around this, I’ve used the replace function to change hyphens into spaces so that the Postgres parser will see each part as a distinct word. Perhaps this is a terrible idea, but it should help you see how the system can work.

/* Toli-toli, Bagan Siapi-api, Fakfak, Oro-oro Ombo */
select to_tsvector('simple', replace('Toli','-',' ')) union all
select to_tsvector('simple', replace('Toli-toli','-',' ')) union all
select to_tsvector('simple', replace('Bagan Siapi-api','-',' ')) union all
select to_tsvector('simple', replace('Fakfak','-',' ')) union all
select to_tsvector('simple', replace('Oro-oro Ombo','-',' '))

Below are the results:

to_tsvector
'toli':1
'toli':1,2
'api':3 'bagan':1 'siapi':2
'fakfak':1
'ombo':3 'oro':1,2

Notice that “Toli” and “Toli-toli” both become “toli”, which is what I think you are trying to accomplish.

Best would be a language adapted to your language, of course.

If you wish to literally work with syllables, you should find a natural language processing tool which does such things. If it is in Python or Perl or Tcl, you should be able to import the functionality into PostgreSQL using the respective PL language.

If you are happy to work with reduplicated sequences of letters without regard to their phonology, you can use regular expressions with backreferences. For example, to find any sequence of 3 letters repeated later in the string, you can use:

select regexp_match('Bagan Siapi-api','(.{3}).*1','i');

Leave a Reply

Your email address will not be published. Required fields are marked *