Remove Duplicate Records from Table in SQL Server

Views: 2457
Comments: 0
Like/Unlike: 0
Posted On: 16-Dec-2015 14:44

Share:

Rahul M...

4960 Points

31 Posts

Introduction

SQL Server tables should never have duplicate rows, nor non-unique primary keys.Duplicate rows are a violation of entity integrity, and should be disallowed in a relational system. SQL Server has various techniques for enforcing entity integrity, including indexes, UNIQUE constraints, PRIMARY KEY constraints, and triggers.
Despite this, under unusual circumstances duplicate rows may occur, and if so they must be eliminated. One way they can occur is if duplicate row exist in non-relational data outside SQL Server, and the data is imported while row uniqueness is not being enforced. Another way they can occur is through a wrong database design, such as not enforcing entity integrity on each table.
This article shows how to locate and remove duplicate rows from a table.

Problem

Suppose we have a tblCustomer table have some duplicate rows as:

Customer_id	Customer_name	Customer_gender	Customer_age
c0023	abc	Male	30
c0023	abc	Male	30
c0023	abc	Male	30
c0050	xyz	Male	45
c0050	xyz	Male	45
c0050	xyz	Male	45
c0022	sdf	FeMale	20
c0022	sdf	FeMale	20

In the above table we see that there are many duplicate rows.

Solution (I)

You can use ROW_NUMBER() with PARTITION BY to delete duplicate.

Use following CTE Query to delete duplicate record from the table tblCustomer:

WITH tmp AS(SELECT Customer_id,Customer_name,Customer_gender,Customer_age,ROW_NUMBER() OVER(PARTITION BY Customer_id,Customer_name,Customer_gender,Customer_age ORDER BY name) AS duplicateCount FROM tblCustomer)
DELETE FROM tmp WHERE tmp.duplicateCount>1

Solution (II)

The first step is to identify which rows have duplicate values:

SELECT Customer_id,Customer_name,Customer_gender,Customer_age, count(*) as duplicatecount

FROM tblCustomer

GROUP BY Customer_id,Customer_name,Customer_gender,Customer_age

HAVING count(*) > 1

This will return three row for each set of duplicate rows in the table. The last column in this result is the number of duplicates for the particular row.

Customer_id	Customer_name	Customer_gender	Customer_age	duplicatecount
c0023	abc	Male	30	3
c0050	xyz	Male	45	3
c0022	sdf	FeMale	20	2

If there are only a few sets of duplicate rows, the best procedure is to delete these manually on an individual basis. For example:

set rowcount 2

delete from tblCustomer

where Customer_id='c0023' AND Customer_name='abc' AND Customer_gender='Male' AND Customer_age=30

The rowcount value should be n-1 the number of duplicates for a given row. In this example, there are 3 duplicates so rowcount is set to 3-1=2

If there are many distinct sets of duplicate rows in the table, it may be too time-consuming to remove them individually. In this case the following procedure can be used:
First, run the above GROUP BY query to determine how many sets of duplicate rows exist, and the count of duplicates for each set.
Select the distinct rows into a tmpCustomer table. For example:

SELECT DISTINCT * INTO tmpCustomer FROM tblCustomer

At this point, the tmpCustomer table should have unique rows. Verify that each row in tmpCustomer is unique, and that you do not have duplicate rows. For example, the query:

SELECT Customer_id,Customer_name,Customer_gender,Customer_age, count(*) as duplicatecount

FROM tmpCustomer

GROUP BY Customer_id,Customer_name,Customer_gender,Customer_age

HAVING count(*) > 1

should return a count of 1 for each row.
Delete the duplicate rows from the original table (tblCustomer). For example:

DELETE FROM tblCustomer

Put the unique rows back in the original table. For example:

INSERT tblCustomer SELECT * FROM tmpCustomer

Result

Now the tblCustomer table looks as:

Customer_id	Customer_name	Customer_gender	Customer_age
c0023	abc	Male	30
c0050	xyz	Male	45
c0022	sdf	FeMale	20

Conclusion

In this article we uses simple T-SQL to delete duplicate rows from a table by Solution (I) and Solution (II). Hopefully, this article will help you do that.

0 Comments