Faulty oracle11g regexp_replace on removing duplicates

Posted on

Question :

I have a comma separated list of values (as one single string), which contains duplicates, and I want to remove them. I tried with REGEXP_REPLACE but for certain values it gives me a wrong result:

/* CORRECT */
select '308RONBADB046782001,308RONBADB046782001,308RONCRT0046782001,308RONCRT0046782001,308RONMFRT046782001,308RONMFRT046782001' as str,
       regexp_replace('308RONBADB046782001,308RONBADB046782001,308RONCRT0046782001,308RONCRT0046782001,308RONMFRT046782001,308RONMFRT046782001',
                  '([^,]+)(,*1)+',
                  '1') as str_uq
from dual
union all 
/* FAULTY */
select '106RONBADB0W2562701,106RONCRT00W2562701,106RPB0130350001,106RONBADB0W2562701' as str,
       regexp_replace('106RONBADB0W2562701,106RONCRT00W2562701,106RPB0130350001,106RONBADB0W2562701',
                  '([^,]+)(,*1)+',
                  '1') as str_uq
from dual

Running the above example, will provide the expected result for the first query, but for the second one, it just concatenates all the values without removing duplicates. I guess the fact that the values are starting with 1 has something to do with it.

Can you please help me out with the regex pattern? I cannot find the mistake…

Thank you!

Answer :

In your 2nd example the values are unique, but you’re searching for any repeated match and there’s multiple matches:

‘106RONBADB0W2562701,106RONCRT00W2562701,106RPB0130350001,106RONBADB0W2562701′

It’s not concatenating, it’s removing some characters 🙂

Usually adding a word boundary b should work, but not in Oracle.

I found this is working '(.+?,)1+', but you need to add a final comma to the string (fiddle)

/* CORRECT */
select '308RONBADB046782001,308RONBADB046782001,308RONCRT0046782001,308RONCRT0046782001,308RONMFRT046782001,308RONMFRT046782001' as str,
       regexp_replace('308RONBADB046782001,308RONBADB046782001,308RONCRT0046782001,308RONCRT0046782001,308RONCRT0046782001,308RONMFRT046782001,308RONMFRT046782001' || ',',
                  --'(b[^,]+)(,*1)+', --should work, doesn't in Oracle
                  '(.+?,)1+',
                  '1') as str_uq
from dual
union all 
/* FAULTY */
select '106RONBADB0W2562701,106RONCRT00W2562701,106RPB0130350001,106RONBADB0W2562701' as str,
       regexp_replace('106RONBADB0W2562701,106RONCRT00W2562701,106RPB0130350001,106RONBADB0W2562701' || ',',
                  --'(b[^,]+)(,*1)+', --should work, doesn't in Oracle
                  '(.+?,)1+',
                  '1') as str_uq
from dual

Indeed, as dnoeth said, it seems that I cannot achieve the expected behavior using only REGEXP, so I built a function that will do this. I’ll post it here, maybe it is useful to somebody else too.

create or replace function str_remove_duplicates (str in varchar2, separator in varchar2) 
    return varchar2
is
    regex_pattern varchar2(5);
    result varchar2(255);
begin

    regex_pattern := '[^' || nvl(separator, ',') || ']+'; -- '[^,]+'

    select listagg(acc, separator) within group (order by acc) into result
    from (
            select distinct acc from (
                                        select regexp_substr(str, regex_pattern, 1, level) as acc from dual
                                        connect by regexp_substr(str, regex_pattern, 1, level) is not null
                                     )
    );

    return result;
end str_remove_duplicates;

All the best!

Daniel

Leave a Reply

Your email address will not be published. Required fields are marked *