Delete duplicates using SQL
Use the DISTINCT keyword in conjunction with the DELETE command in SQL to eliminate duplicates in one or more columns of a table.
Here's an example:
Consider a table called test_table
with two columns col1
and col2.
Now delete duplicate rows based on both columns:
DELETE FROM test_table
WHERE (col1, col2) NOT IN (
SELECT DISTINCT col1, col2
FROM test_table
);
In this illustration, the subquery enclosed in brackets chooses all unique sets of values for col1 and col2. Only the rows that are not a part of the distinct set are eliminated thanks to the NOT IN operator in the WHERE clause. Be aware that you must change the query by deleting the second column from the subquery and the comparison in the WHERE clause if you want to eliminate duplicate records based on just one field.
For example, to delete duplicates based on col1
only, you would use:
DELETE FROM test_table
WHERE col1 NOT IN (
SELECT DISTINCT col1
FROM test_table
);
Always create a backup of the data before executing a delete query since deleting rows from a table might be a dangerous activity.
- Deleting duplicates based on a single column:
Consider a table called students
with a column name.
Now
delete any duplicate rows based on the name
column:
DELETE FROM students WHERE id NOT IN ( SELECT MIN(id) FROM students GROUP BY name );
In this example, the subquery selects the minimum id
for each distinct name
value, and the NOT IN
operator in the WHERE
clause ensures that only rows that are not part of this set are deleted.
- Deleting duplicates based on multiple columns:
Consider a table called orders
with two columns customer_id
and product_id.
Now
delete any duplicate rows based on both columns:
DELETE FROM orders WHERE (customer_id, product_id) NOT IN ( SELECT MIN(customer_id), MIN(product_id) FROM orders GROUP BY customer_id, product_id );
In this example, the subquery selects the minimum customer_id
and product_id
for each distinct combination of values and the NOT IN
operator in the WHERE
clause ensures that only rows that are not part of this set are deleted.
- Deleting all duplicates:
To delete all duplicates in a table regardless of the columns they are based on, use the ROW_NUMBER()
function in combination with a Common Table Expression (CTE):
WITH cte AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY * ORDER BY id) AS rn FROM test_table ) DELETE FROM cte WHERE rn > 1;
In this illustration, the CTE takes every row from the table and gives each row a distinct row number based on every column (PARTITION BY *). Only the first instance of each row is given a row number of 1, thanks to the ROW_NUMBER() function. The DELETE statement then eliminates any duplicate entries from the table by deleting all rows with a row number greater than 1.
Comments
Post a Comment