What is deduplication and how does it work?

What is deduplication?

Deduplication is the process of identifying duplicate records and merging them together into a single, well-defined version (sometimes called “golden record”, or the "single version of the truth”).

Where can I use deduplication in Octolis?

In Octolis, deduplication is used to unify data coming from different Sources and build Datasets that are free of duplicate records.

How does deduplication work in Octolis?

How to identify duplicate records?

Octolis enables you to identify duplicate records based on a set of columns (aka a deduplication key).

Let’s take a few examples:

Your want to identify as duplicates contacts that have the same email.
Your want to identify as duplicates contacts that have the same email and the same phone.

Each time a new record enters the Dataset, we will compare its deduplication key to the ones of all records that already exist in the Dataset, and identify if it is a duplicate.

If several records enter the Dataset all at once, we also make sure to identify duplicates amongst them.

How does the merge work?

Each time Octolis identifies duplicate records (based on the deduplication key you set), they are merged together, resulting in a single record in the Dataset.

We take the values of the most recent duplicate record to build the final record, except for Source key columns (‣ ).

We will soon enable you to use more custom rules to build each column of the final record.
We always preserve a stable association between a deduplicated record and each Source key it was first associated with.

We also automatically add several system columns to the Dataset deduplicated records:

__master-Id__ to uniquely identify each deduplicated record (stable over time).
__modified-At__ to state when each record was updated for the last time in the Sources.
__<SourceName>_<SourcekeyColumn>_list__ (for each Source Key column) to list the Source key values of all duplicates the deduplicated record is resulting from.
__created-At__ to state when each record was created in the DB table (stable over time).
__updated-At__ to state when each record was updated in the DB table.

How to use the output of deduplication in my systems?

Thanks to the work of Octolis, you are ensured that only deduplicated records will be synced to your systems.

What now? You might want to do some data stewarding to clean your systems from duplicate records.

For this, we advise you to map in a Sync towards your system the __master-Id__ and the list of the Source key values of all duplicates the deduplicated record is resulting from.

We will later offer some native capability of tagging duplicates as Duplicate of Id XXX in a dedicated field.