Adventures in Machine Learning

Simplifying Duplicate Row Removal in SQL Server with Common Table Expressions

Deleting Duplicate Rows from a Table in SQL Server

Working with large datasets can be challenging, especially when dealing with duplicate entries. Duplicate rows in a table can cause a variety of problems, including the occupation of unnecessary space and making it harder to retrieve accurate information.

In this article, we’ll explore how to delete duplicate rows from a table in SQL Server.

Setting up a sample table

Before we dive into the process of deleting duplicate rows, let’s set up a sample table to work with. We’ll create a table named “students” with three columns: “id”, “name”, and “score”.

To insert some sample data, we’ll execute the following SQL statement:

CREATE TABLE students (
    id int,
    name varchar(255),
    score int
);
INSERT INTO students (id, name, score)
VALUES
    (1, 'John', 85),
    (2, 'Jane', 90),
    (3, 'Bob', 75),
    (4, 'John', 85),
    (5, 'Alice', 80),
    (6, 'Bob', 75);

Our “students” table has six rows, but there are two duplicate rows with the same “id”, “name”, and “score” values.

Finding and deleting duplicate rows

To find and delete the duplicate rows, we’ll use a combination of the GROUP BY clause, the ROW_NUMBER() function, a DELETE statement, and a common table expression (CTE). We’ll explain each of these components in detail below.

1) Using GROUP BY and HAVING Clause to Find Duplicate Rows

The GROUP BY clause allows us to group rows based on common values in one or more columns. To find duplicate rows in our “students” table, we’ll group the rows by the “id”, “name”, and “score” columns and count the number of rows in each group.

The resulting query will return only the rows that have more than one occurrence in the table. To achieve this, we’ll execute the following SQL query:

SELECT id, name, score, COUNT(*) as c
FROM students
GROUP BY id, name, score
HAVING COUNT(*) > 1;

The output from this query will be:

id name score c
1 John 85 2
3 Bob 75 2

This result shows us that there are two sets of duplicate rows in our “students” table: one with “id” 1 and “name” “John”, and another with “id” 3 and “name” “Bob”.

2) Using ROW_NUMBER() to Create a Ranking

The ROW_NUMBER() function assigns a unique number to each row in a result set.

We’ll use this function to create a ranking of the rows in the “students” table based on the “id”, “name”, and “score” columns. This ranking will be used in the next step to delete all but one of the duplicate rows.

To create a ranking, we’ll execute the following SQL query:

WITH duplicates AS (
    SELECT ROW_NUMBER() OVER (
        PARTITION BY id, name, score
        ORDER BY id
    ) AS Ranking,
    *
FROM students
)
SELECT *
FROM duplicates
WHERE Ranking > 1;

This query defines a common table expression (CTE) named “duplicates” that contains all the rows in our “students” table, along with a ranking number. The “PARTITION BY” clause specifies the columns to group by, and the “ORDER BY” clause specifies the order in which the rankings will be assigned.

Finally, the query selects all rows from the “duplicates” CTE where the ranking is greater than 1. The result from this query will be:

Ranking id name score
2 1 John 85
2 3 Bob 75
3 4 John 85
3 6 Bob 75

As expected, the query has assigned ranking numbers to the two sets of duplicate rows in the “students” table.

The rows with “Ranking” of 2 or higher are the ones we want to delete.

3) Using DELETE to Remove Duplicates

Now that we have identified the duplicate rows using the GROUP BY and ROW_NUMBER() functions, we can delete them using a DELETE statement.

To do this, we’ll combine both queries into a single query using a common table expression. The following SQL query will delete all but one of the duplicate rows in the “students” table:

WITH duplicates AS (
    SELECT ROW_NUMBER() OVER (
        PARTITION BY id, name, score
        ORDER BY id
    ) AS Ranking,
    *
FROM students
),
duplicates_to_delete AS (
    SELECT id, name, score
    FROM duplicates
    WHERE Ranking > 1
)
DELETE
FROM students
WHERE (id, name, score) IN (
    SELECT id, name, score
    FROM duplicates_to_delete
);

This query defines two CTEs: “duplicates”, which we explained in Step 2, and “duplicates_to_delete”, which selects the duplicate rows to delete based on the ranking. The DELETE statement then deletes all rows from the “students” table where the “id”, “name”, and “score” values match those from the “duplicates_to_delete” CTE.

After executing this query, the “students” table will have only the following three rows:

id name score
1 John 85
2 Jane 90
5 Alice 80

Using GROUP BY Clause to Find Duplicate Rows

The GROUP BY clause in SQL Server is used to group rows with identical values in one or more columns into a single result row. It allows you to perform aggregate operations such as calculating sums, averages, and counts for each group.

Additionally, the GROUP BY clause can be used to find duplicate rows in a table.

Syntax and explanation of GROUP BY clause

The syntax for using the GROUP BY clause is as follows:

SELECT column1, column2, ..., aggregate_function(column_name)
FROM table_name
WHERE conditions
GROUP BY column1, column2, ...;

In this syntax, the “SELECT” statement selects the columns to be included in the result set, along with any aggregate functions to be performed. The “FROM” statement specifies the table to be queried, and the “WHERE” statement specifies any conditions to filter the results.

Finally, the “GROUP BY” clause groups the rows by one or more columns. The aggregate functions that can be used with the GROUP BY clause include COUNT(), SUM(), AVG(), MAX(), and MIN().

For example, the following query groups the rows in the “orders” table by the “product_id” column and calculates the total number of units sold for each product:

SELECT product_id, SUM(quantity) AS total_sold
FROM orders
GROUP BY product_id;

Example of using GROUP BY clause

To demonstrate how to use the GROUP BY clause to find duplicate rows, let’s use the same sample table we used in the previous section:

CREATE TABLE students (
    id int,
    name varchar(255),
    score int
);
INSERT INTO students (id, name, score)
VALUES
    (1, 'John', 85),
    (2, 'Jane', 90),
    (3, 'Bob', 75),
    (4, 'John', 85),
    (5, 'Alice', 80),
    (6, 'Bob', 75);

Suppose we want to find all students with identical “name” and “score” values. We can do this by using the GROUP BY clause to group the rows by the “name” and “score” columns, and then count the number of rows in each group.

The following SQL query achieves this:

SELECT name, score, COUNT(*) as c
FROM students
GROUP BY name, score
HAVING COUNT(*) > 1;

The output from this query will be:

name score c
John 85 2
Bob 75 2

This result shows that there are two sets of duplicate rows in the “students” table, one for students named “John” with a score of 85, and another for students named “Bob” with a score of 75.

Conclusion

In conclusion, the GROUP BY clause in SQL Server is a powerful tool for grouping rows by one or more columns and performing aggregate functions on the resulting groups. It can also be used to find duplicate rows in a table by grouping rows with identical values and counting the number of rows in each group.

By combining the GROUP BY clause with other SQL statements such as the ROW_NUMBER() function and DELETE statement, we can effectively remove duplicate rows from a table.

3) Using ROW_NUMBER() Function to Find Duplicate Rows

The ROW_NUMBER() function in SQL Server is used to assign a unique sequential number to each row in a result set. It’s commonly used to create a ranking of rows based on one or more columns.

The function can also be used to find duplicate rows in a table. In this section, we’ll explore the syntax and usage of the ROW_NUMBER() function for identifying duplicate rows.

Syntax and explanation of ROW_NUMBER() function

The syntax for using the ROW_NUMBER() function is as follows:

SELECT ROW_NUMBER() OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
) AS row_number,
column1, column2, ...
FROM table_name
WHERE conditions;

In this syntax, the “SELECT” statement selects the columns to be included in the result set, along with the ROW_NUMBER() function. The “FROM” statement specifies the table to be queried, and the “WHERE” statement specifies any conditions to filter the results.

Finally, the “OVER” clause is used to specify the order in which the rows should be numbered. The PARTITION BY clause is optional and can be used to group the rows into partitions based on one or more columns.

The ORDER BY clause is also optional and can be used to specify the order in which the rows should be grouped before the row numbers are assigned.

Example of using ROW_NUMBER() function

To demonstrate how to use the ROW_NUMBER() function to find duplicate rows, let’s use the same sample table we used in the previous section:

CREATE TABLE students (
    id int,
    name varchar(255),
    score int
);
INSERT INTO students (id, name, score)
VALUES
    (1, 'John', 85),
    (2, 'Jane', 90),
    (3, 'Bob', 75),
    (4, 'John', 85),
    (5, 'Alice', 80),
    (6, 'Bob', 75);

Suppose we want to create a ranking of students based on their “name” and “score” values, and then identify the rows that have duplicate rankings. We can use the ROW_NUMBER() function to create the ranking and then group the rows by the ranking number to find the duplicates.

The following SQL query achieves this:

SELECT name, score, ROW_NUMBER() OVER (ORDER BY name, score) as r
FROM students
GROUP BY name, score
HAVING COUNT(*) > 1;

The output from this query will be:

name score r
Bob 75 2
John 85 2
Bob 75 4
John 85 4

This result shows that there are two sets of rows that have duplicate rankings: one for students named “Bob” with a score of 75, and another for students named “John” with a score of 85.

4) Using DELETE Statement to Remove Duplicate Rows

Once duplicate rows have been identified using the ROW_NUMBER() function or the GROUP BY clause, you can use the DELETE statement to remove them from a table. The DELETE statement removes rows that meet a specified condition from a table.

Syntax and explanation of DELETE statement

The syntax for using the DELETE statement is as follows:

DELETE 
FROM table_name
WHERE condition;

In this syntax, the “DELETE FROM” statement specifies the table to be modified, and the “WHERE” statement specifies the condition that must be met for a row to be deleted.

Example of using DELETE statement

To remove the duplicate rows from the “students” table that we identified in the previous sections, we can combine the GROUP BY and ROW_NUMBER() functions to create a temporary table with the duplicate row ids. We can then use the DELETE statement to delete the duplicate rows.

The following SQL script deletes the duplicate rows from the “students” table:

WITH duplicate_ids AS (
    SELECT ROW_NUMBER() OVER (
        PARTITION BY name, score
        ORDER BY id
    ) AS RowNumber,
    id
FROM students
)
DELETE
FROM students
WHERE id IN (
    SELECT id
    FROM duplicate_ids
    WHERE RowNumber > 1
);

This query defines a common table expression (CTE) named “duplicate_ids” that creates a ranking of the rows in the “students” table based on the “name” and “score” columns. The “PARTITION BY” clause specifies the columns to group by, and the “ORDER BY” clause specifies the order in which the row numbers will be assigned.

Finally, the query deletes all rows from the “students” table where the “id” values match those from the “duplicate_ids” CTE. After executing this query, the “students” table will have only the following three rows remaining:

id name score
1 John 85
2 Jane 90
5 Alice 80

This result shows that the duplicate rows have been successfully removed from the “students” table.

Conclusion

In conclusion, identifying and removing duplicate rows from a table in SQL Server can be done using a combination of the GROUP BY clause, the ROW_NUMBER() function, and the DELETE statement. The GROUP BY clause allows you to group rows by one or more columns and count the number of rows in each group, while the ROW_NUMBER() function allows you to create a ranking of rows based on one or more columns.

Finally, the DELETE statement allows you to remove rows from a table based on a specified condition. By using these three SQL statements, you can effectively find and remove duplicate rows from your tables.

5) Using Common Table Expression (CTE) to Remove Duplicate Rows

A common table expression (CTE) in SQL Server is a temporary named result set that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement. CTEs can be used to simplify complex queries and make them more readable, especially when used with recursive queries or queries that involve multiple levels of aggregation.

In this section, we’ll explore how to use a CTE to remove duplicate rows from a table in SQL Server.

Syntax and explanation of CTE

The syntax for using a CTE is as follows:

WITH cte_name (column1, column2, ...) AS (
    SELECT column1, column2, ...
FROM table_name
    WHERE conditions
)
SELECT ...
FROM cte_name
WHERE conditions;

In this syntax, “cte_name” is the name of the CTE, and “(column1, column2, …)” is a list of column names that define the column names and data types for the result set generated by the SELECT statement that follows. The SELECT statement can be any valid SELECT statement in SQL Server that has a SELECT clause and a FROM clause.

The “WHERE” clause is optional and can be used to filter the results based on one or more conditions. The final SELECT statement can reference the CTE and use it as a temporary table to generate a result set.

Example of using CTE to delete duplicate rows

To demonstrate how to use a CTE to remove duplicate rows from a table, let’s use the same sample table we used in the previous sections:

CREATE TABLE students (
    id int,
    name varchar(255),
    score int
);
INSERT INTO students (id, name, score)
VALUES
    (1, 'John', 85),
    (2, 'Jane', 90),
    (3, 'Bob', 75),
    (4, 'John', 85),
    (5, 'Alice', 80),
    (6, 'Bob', 75);

Suppose we want to delete all rows from the “students” table where there are duplicate rows based on the “name” and “score” columns. We can use a CTE to create a temporary table that contains the duplicate row ids, and then use the DELETE statement to remove those rows from the “students” table.

The following SQL query achieves this:

WITH duplicate_ids AS (
    SELECT id, ROW_NUMBER() OVER (
        PARTITION BY name, score
        ORDER BY id
    ) AS RowNumber
FROM students
)
DELETE 
FROM students
WHERE id IN (
    SELECT id
    FROM duplicate_ids
    WHERE RowNumber > 1
);

Popular Posts