Postgres: trying to fix some unicode encoding issues

Posted on

Question :

I’ve got a postgres 8.4 database with a few mangled characters in one of the columns on a table. For example, the word “don’t” gets displayed within psql as:

donâu0080u0099t

I want to be able to dig out these odd characters and replace them with simple ' or " alternatives. However, I’m having a bit of a job getting either ~ or LIKE to be able to pull the correct rows in, let alone be able to replace them.

The database is UTF8 and the client_encoding is also set to UTF8. I’ve tried selecting u string and x strings, but can’t seem to get anything reliable. For example, all of the following queries return 0 rows or ERROR: invalid byte sequence for encoding "UTF8": 0x80

select col from tab where col like '%u0080u0099%';
select col from tab where col ~'u0080u0099';
select col from tab where col like '%x8099%';
select col from tab where col ~'x8099';

Any ideas?

Answer :

If what you see is donâu0080u0099t then those are literal backslashes in your table, not escape sequences.

Try to double the backslashes to get literal ones:

select col from tab where col ~~ E'%\u0080%';

And use the proper E'' syntax for PostgreSQL 8.4 with that.

Leave a Reply

Your email address will not be published. Required fields are marked *