Contacts Us

Concepts

Connectors

Connectors listing

Audiences

Organize your Audiences

Create a Dataset

Create a SQL Audience

Add Source to a Dataset

Transform an Audience

Syncs

Create new sync

Resources

FAQ

Scoring Library

đź““ API Docs

What is deduplication and how does it work?

What is deduplication?


Deduplication is the process of identifying duplicate records and merging them together into a single, well-defined version (sometimes called “golden record”, or the "single version of the truth”).

Where can I use deduplication in Octolis?


In Octolis, deduplication is used to unify data coming from different Sources and build Datasets that are free of duplicate records.

How does deduplication work in Octolis?


How to identify duplicate records?

Octolis enables you to identify duplicate records based on a set of columns (aka a deduplication key).

Let’s take a few examples:

Each time a new record enters the Dataset, we will compare its deduplication key to the ones of all records that already exist in the Dataset, and identify if it is a duplicate.

If several records enter the Dataset all at once, we also make sure to identify duplicates amongst them.

How does the merge work?

Each time Octolis identifies duplicate records (based on the deduplication key you set), they are merged together, resulting in a single record in the Dataset.

We take the values of the most recent duplicate record to build the final record, except for Source key columns (‣ ).

We also automatically add several system columns to the Dataset deduplicated records:

How to use the output of deduplication in my systems?


Thanks to the work of Octolis, you are ensured that only deduplicated records will be synced to your systems.

What now? You might want to do some data stewarding to clean your systems from duplicate records.

For this, we advise you to map in a Sync towards your system the __master-Id__ and the list of the Source key values of all duplicates the deduplicated record is resulting from.

We will later offer some native capability of tagging duplicates as Duplicate of Id XXX in a dedicated field.