Delete duplicates using SQL

Use the DISTINCT keyword in conjunction with the DELETE command in SQL to eliminate duplicates in one or more columns of a table.

Here's an example:

Consider a table called test_table with two columns col1 and col2.

Now delete duplicate rows based on both columns:

DELETE FROM test_table WHERE (col1, col2) NOT IN ( SELECT DISTINCT col1, col2 FROM test_table );

In this illustration, the subquery enclosed in brackets chooses all unique sets of values for col1 and col2. Only the rows that are not a part of the distinct set are eliminated thanks to the NOT IN operator in the WHERE clause. Be aware that you must change the query by deleting the second column from the subquery and the comparison in the WHERE clause if you want to eliminate duplicate records based on just one field.

For example, to delete duplicates based on col1 only, you would use:

DELETE FROM test_table WHERE col1 NOT IN ( SELECT DISTINCT col1 FROM test_table );

Always create a backup of the data before executing a delete query since deleting rows from a table might be a dangerous activity.

Deleting duplicates based on a single column:

Consider a table called students with a column name.

Now delete any duplicate rows based on the name column:

DELETE FROM students WHERE id NOT IN ( SELECT MIN(id) FROM students GROUP BY name );

In this example, the subquery selects the minimum id for each distinct name value, and the NOT IN operator in the WHERE clause ensures that only rows that are not part of this set are deleted.

Deleting duplicates based on multiple columns:

Consider a table called orders with two columns customer_id and product_id.

Now delete any duplicate rows based on both columns:

DELETE FROM orders WHERE (customer_id, product_id) NOT IN ( SELECT MIN(customer_id), MIN(product_id) FROM orders GROUP BY customer_id, product_id );

In this example, the subquery selects the minimum customer_id and product_id for each distinct combination of values and the NOT IN operator in the WHERE clause ensures that only rows that are not part of this set are deleted.

Deleting all duplicates:

To delete all duplicates in a table regardless of the columns they are based on, use the ROW_NUMBER() function in combination with a Common Table Expression (CTE):

WITH cte AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY * ORDER BY id) AS rn FROM test_table ) DELETE FROM cte WHERE rn > 1;

In this illustration, the CTE takes every row from the table and gives each row a distinct row number based on every column (PARTITION BY *). Only the first instance of each row is given a row number of 1, thanks to the ROW_NUMBER() function. The DELETE statement then eliminates any duplicate entries from the table by deleting all rows with a row number greater than 1.

Vipin Kumar Tiwari

Search This Blog

Delete duplicates using SQL

Comments

Post a Comment