Is there some function in PostgreSQL to simply check if the input string is valid UTF-8 according to PG?

Posted on

Question :

For several years now, I’ve been hunting this extremely quick-footed little bug which knows every hiding place in my metaphorical coding house. He lurks in the walls, sometimes takes months to show himself, and BAM! He’s back, without forewarning. And the hunt continues. And then he runs into a hole in the wall and I can’t get him out again no matter what. And so the wait begins…

To make a long story short: PHP and PostgreSQL simply do not have a fully consistent idea of what constitutes a valid UTF-8 string. I can now say this for 100% sure.

To avoid the errors that inevitably are logged when I try to INSERT data which PHP claims is valid UTF-8 but PostgreSQL thinks is not (regardless of who is actually right), I must make some kind of pre-query whose only job is to check if the string is valid UTF-8 according to PostgreSQL. Is this possible somehow?

I’m thinking of something like this:

SELECT is_string_valid_utf8_according_to_postgresql($1);

This would then return a true/false so that I can know for sure if I can go ahead and execute the real, INSERT query, or skip it.

Answer :

PostgreSQL is very strict about encodings, so you normally won’t be able to get an incorrectly encoded string into the database. That is, your proposed function cannot operate on text, because you won’t get an incorrectly encoded string into the database as text.

But if you load the string as binary data (type bytea), you can check the string with a function like this:

CREATE FUNCTION is_valid_utf8(bytea) RETURNS boolean
   LANGUAGE plpgsql AS
$$BEGIN
   PERFORM convert_from($1, 'UTF8');
   RETURN TRUE;
EXCEPTION
   WHEN character_not_in_repertoire THEN
      RAISE WARNING '%', SQLERRM;
      RETURN FALSE;
END;$$;

Here it is in action:

SELECT is_valid_utf8('x736368c3b66e');

 is_valid_utf8 
---------------
 t
(1 row)

SELECT is_valid_utf8('x736368c3f66e');

WARNING:  invalid byte sequence for encoding "UTF8": 0xc3 0xf6

 is_valid_utf8 
---------------
 f
(1 row)

Leave a Reply

Your email address will not be published. Required fields are marked *