Delete Duplicate Records using CTE in TSQL

We have already discussed in a previous tutorial on how to delete duplicate records from a table using IDENTITY column approach (add,delete and drop identity column) – Check this out. The only problem with this approach is that the underlying schema needs to be modified and reverted back after deletion.

In this tutorial, a much more cleaner approach without touching the schema will be demonstrated. We use Common Table Expression (CTE) and we group duplicate rows using OVER PARTITION BY and then version each row of group using ROW_NUMBER(). Later we delete all the duplicates in the groups except the first row i.e., with ROW_NUMBER() = 1. This way we can delete all duplicates. This approach is much cleaner and readable and on top of it, there is no need to touch schema of the table at all.

Code is self explanatory and comments are provided for ease of understanding.

/**************************** Create TestDB Database *****************************/ 
 
IF  NOT EXISTS (SELECT [name] FROM master.[dbo].[sysdatabases] WHERE [name] = 'TestDB')
BEGIN
       CREATE DATABASE [TestDB];
END
GO
 
USE [TestDB]
GO

/**************************** Create Employee Table ******************************/

IF EXISTS(SELECT * FROM sys.sysobjects WHERE type = 'U' AND name = 'Employee')
BEGIN
    DROP TABLE [dbo].[Employee];
END
GO

CREATE TABLE [dbo].[Employee](
    [EmpId] [int] NOT NULL,
    [EmpName] [nvarchar](32) NOT NULL
)
GO

/*********************** Enter duplicate Employee Details ************************/

INSERT INTO [dbo].[Employee] ([EmpId],[EmpName]) VALUES (1,'Rami');
INSERT INTO [dbo].[Employee] ([EmpId],[EmpName]) VALUES (2,'Venu');
INSERT INTO [dbo].[Employee] ([EmpId],[EmpName]) VALUES (3,'Todd');
INSERT INTO [dbo].[Employee] ([EmpId],[EmpName]) VALUES (4,'Young');

INSERT INTO [dbo].[Employee] ([EmpId],[EmpName]) VALUES (1,'Rami');
INSERT INTO [dbo].[Employee] ([EmpId],[EmpName]) VALUES (2,'Venu');
INSERT INTO [dbo].[Employee] ([EmpId],[EmpName]) VALUES (3,'Todd');
INSERT INTO [dbo].[Employee] ([EmpId],[EmpName]) VALUES (4,'Young');

INSERT INTO [dbo].[Employee] ([EmpId],[EmpName]) VALUES (1,'Rami');
INSERT INTO [dbo].[Employee] ([EmpId],[EmpName]) VALUES (2,'Venu');
INSERT INTO [dbo].[Employee] ([EmpId],[EmpName]) VALUES (3,'Todd');
INSERT INTO [dbo].[Employee] ([EmpId],[EmpName]) VALUES (4,'Young');
GO 

/************************* Delete Duplicates using CTE ***************************
We use OVER PARTITION BY to group duplicate records, where we suplly EmpId and 
EmpName and grouping the records based on their values. ROW_NUMBER() is used as to 
version each row of the group. Then using a DELETE query on CTE, the underlying 
Employee Table duplicates are removed.
*********************************************************************************/

 WITH DeleteRecords AS 
 (
    SELECT
          [EmpId],
          [EmpName],
          ROW_NUMBER() OVER (PARTITION BY [EmpId],[EmpName] ORDER BY [EmpId]) AS DeleteNumber
    FROM
          [dbo].[Employee]
 )
 DELETE 
 FROM
        DeleteRecords
 WHERE
        DeleteNumber > 1;


/************************** Select from Employee Table ***************************/
SELECT
       *
FROM
       [dbo].[Employee]
ORDER BY
       [EmpId];

 
/***************************** Drop TestDB Database ******************************/
USE master
GO
 
IF EXISTS (SELECT [name] FROM master.[dbo].[sysdatabases] WHERE [name] = 'TestDB')
BEGIN
       DROP DATABASE [TestDB];
END
GO

 

Output :

DuplicateRows1

You may also like...