Finding duplicates

Posted on

Question :

The scenario is as follows, a while ago a colleague accidentally ran a query similar to

 DELETE FROM app_i18n WHERE disabled = 0 and translation is null or translation = '';

This is the table structure of app_i18n

CREATE TABLE app_i18n (
     `id` INTEGER(11) NOT NULL AUTO_INCREMENT,
     `category` INTEGER(11) unsigned NOT NULL,
     `keyword`  varchar(50) DEFAULT NULL,
     `language` varchar(2) NOT NULL DEFAULT 'en',
     `translation` text NOT NULL,
     `disabled` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY (`category`, `keyword`, `language`, `disabled`)
) ENGINE=InnoDB;

The resulting problem is that some of those “translations” while empty were there for a reason. So a restore was in order. From an old copy of the db he restored all the rows that were empty or null and were not disabled.

The problem has been compounded by an unknown number of other developers quickly throwing in “translations” to fix the issues they were experiencing. So I thought the solution would be

SELECT id 
  FROM app_i18n 
 WHERE disabled = 0 
   AND (translation is null or translation = '') 
 GROUP BY category, keyword, language 
HAVING COUNT(id) > 1;

Would that query yield the answers that I want? i.e. all records that having duplicate category, keywords in the same language which have duplicates?

Answer :

On most RDBMSs your query causes an error. I believe MySQL returns a result, but not necessarily what you are looking for.

When you group by a set of columns, only columns that are mentioned in the group by clause can also be mentioned in the directly in the select clause. All other columns have to be wrapped in some kind of aggregate like MAX() or AVG().

You are looking for all category, keyword and language combinations that have duplicate entries. So you want your query to return those columns:

SELECT category, keyword, language 
  FROM app_i18n 
 WHERE disabled = 0 
   AND (translation is null or translation = '') 
 GROUP BY category, keyword, language 
HAVING COUNT(id) > 1;

If you want all columns of the rows that have a non-unique category, keyword and language combination, you need to add a join:

SELECT A.*
  FROM app_i18n A
  JOIN (
       SELECT category, keyword, language 
         FROM app_i18n A2
        WHERE disabled = 0 
          AND (translation is null or translation = '') 
 GROUP BY category, keyword, language 
HAVING COUNT(id) > 1)C
    ON A.category = C.category
   AND A.keyword = C.keyword
   AND A.language = C.language;

MySQL has an GROUP_CONCAT function so you can get an comma separated list of id’s that matches your duplicated records.

SELECT
    GROUP_CONCAT(id ORDER BY id ASC) as "duplicated_row_ids"
  , category
  , keyword
  , language 
  FROM app_i18n 
 WHERE disabled = 0 
   AND (translation is null or translation = '') 
 GROUP BY category, keyword, language 
HAVING COUNT(id) > 1;

or

SELECT
    CONCAT('{"id":'
      , GROUP_CONCAT(CONCAT('"', id, '"') ORDER BY id ASC SEPARATOR ',"id":')
      , '}' 
    ) as "duplicated_row_ids" 
  , category
  , keyword
  , language 
  FROM app_i18n 
 WHERE disabled = 0 
   AND (translation is null or translation = '') 
 GROUP BY category, keyword, language 
HAVING COUNT(id) > 1;

for nice JSON id’s

see http://sqlfiddle.com/#!2/3b51a/6 for demo

Leave a Reply

Your email address will not be published.