Remove duplicate rows
Description
The Remove Duplicate Rows activity filters out duplicate entries from a dataset by evaluating values in a specified column. It preserves the first occurrence of each unique value and removes all subsequent duplicates, helping to clean and normalize data for further processing.
This is particularly useful in scenarios where data merging or imports may have resulted in duplicate records, and only the most relevant or earliest instance should be retained.
Use this activity to:
- Clean data before analysis or export
- Remove redundant records during ETL processing
- Prepare datasets for machine learning or reporting by ensuring uniqueness
Use case: In a CRM export where customers may appear multiple times due to recent activity, use this activity to deduplicate by Email or Customer ID, retaining only the earliest instance.
Input
| Type | Status |
|---|---|
| Data | Required |
Output
| Output Type | Format | Description |
|---|---|---|
| Data | Table | The cleaned dataset with only the first instance of each duplicate retained. |
Configuration Fields
| Field Name | Description |
|---|---|
| Column Name | The column based on which duplicate detection is performed. If two or more rows share the same value in this column, only the first row is retained. |
If the column contains empty or null values, those rows are not treated as duplicates of each other unless the values are exactly identical.
Sample Input
| ID | Name | Age | City |
|---|---|---|---|
| 101 | John | 25 | New York |
| 102 | Alice | 30 | Chicago |
| 103 | John | 25 | New York |
| 104 | Bob | 40 | Boston |
| 105 | Alice | 30 | Chicago |
In this example, rows 103 and 105 are duplicates based on the Name column.
Sample Configuration
| Field | Value |
|---|---|
| Column Name | Name |
Sample Output
| ID | Name | Age | City |
|---|---|---|---|
| 101 | John | 25 | New York |
| 102 | Alice | 30 | Chicago |
| 104 | Bob | 40 | Boston |
The duplicate rows for John and Alice were removed, keeping only the first occurrence based on their appearance in the input data.