Remove duplicate rows

Description

The Remove Duplicate Rows activity filters out duplicate entries from a dataset by evaluating values in a specified column. It preserves the first occurrence of each unique value and removes all subsequent duplicates, helping to clean and normalize data for further processing.

This is particularly useful in scenarios where data merging or imports may have resulted in duplicate records, and only the most relevant or earliest instance should be retained.

Use this activity to:

Clean data before analysis or export
Remove redundant records during ETL processing
Prepare datasets for machine learning or reporting by ensuring uniqueness

Use case: In a CRM export where customers may appear multiple times due to recent activity, use this activity to deduplicate by Email or Customer ID, retaining only the earliest instance.

Input

Type	Status
Data	Required

Output

Output Type	Format	Description
Data	Table	The cleaned dataset with only the first instance of each duplicate retained.

Configuration Fields

Field Name	Description
Column Name	The column based on which duplicate detection is performed. If two or more rows share the same value in this column, only the first row is retained.

If the column contains empty or null values, those rows are not treated as duplicates of each other unless the values are exactly identical.

Sample Input

ID	Name	Age	City
101	John	25	New York
102	Alice	30	Chicago
103	John	25	New York
104	Bob	40	Boston
105	Alice	30	Chicago

In this example, rows 103 and 105 are duplicates based on the Name column.

Sample Configuration

Field	Value
Column Name	Name

Sample Output

ID	Name	Age	City
101	John	25	New York
102	Alice	30	Chicago
104	Bob	40	Boston

The duplicate rows for John and Alice were removed, keeping only the first occurrence based on their appearance in the input data.