Archive for May, 2009

Locating and deleting duplicates

Wednesday, May 6th, 2009

One common task is looking for duplicate records and deleting them. I locate them by looking for records where the field in question has the same value across multiple records:

select field1 from
(
  select cnt,field1 from
  (
    select count(1) cnt,field1 from table1 group by field1
  ) x where cnt > 1
) x where cnt > 1

But I can’t just delete all of those, because I want to keep 1 record and delete only the duplicate. So, I find the lowest id for the duplicate records (this is the primary key, hopefully it is always an integer), and that is the record that I want to delete:

select min(id) id,field1
from table1
where field1 in
(
  select field1 from
  (
    select count(1) cnt,field1 from table1 group by field1
  ) x where cnt > 1
)
group by field1

Here’s how I “delete” them. First, I change the parentid (e.g. I change the foreign key to a non-existent record.). Then, I test my application. If the application breaks, I update the parentid to make it what it used to be. It is a quick way of testing and restoring.

update table1
set parentfield = -999
where field1 in
(
  select id from
  (
    select min(id) id,field1
    from table1
    where field1 in
    (
      select field1 from
      (
        select count(1) cnt,field1 from table1 group by field1
      ) x where cnt > 1
    )
    group by field1
    ) x
)

When I’m convinced that my changes didn’t cause a problem, I then delete them:

  delete from table1 where parentid = -999