Organize your Audiences
Create a Dataset
Create a SQL Audience
Add Source to a Dataset
Transform an Audience
Create new sync
📓 API Docs
Deduplication is the process of identifying duplicate records and merging them together into a single, well-defined version (sometimes called “golden record”, or the "single version of the truth”).
In Octolis, deduplication is used to unify data coming from different Sources and build Datasets that are free of duplicate records.
Octolis enables you to identify duplicate records based on a set of columns (aka a deduplication key).
Let’s take a few examples:
Each time a new record enters the Dataset, we will compare its deduplication key to the ones of all records that already exist in the Dataset, and identify if it is a duplicate.
If several records enter the Dataset all at once, we also make sure to identify duplicates amongst them.
Each time Octolis identifies duplicate records (based on the deduplication key you set), they are merged together, resulting in a single record in the Dataset.
We take the values of the most recent duplicate record to build the final record, except for Source key columns (‣ ).
We also automatically add several system columns to the Dataset deduplicated records:
__master-Id__to uniquely identify each deduplicated record (stable over time).
__modified-At__to state when each record was updated for the last time in the Sources.
__<SourceName>_<SourcekeyColumn>_list__(for each Source Key column) to list the Source key values of all duplicates the deduplicated record is resulting from.
__created-At__to state when each record was created in the DB table (stable over time).
__updated-At__to state when each record was updated in the DB table.
Thanks to the work of Octolis, you are ensured that only deduplicated records will be synced to your systems.
What now? You might want to do some data stewarding to clean your systems from duplicate records.
For this, we advise you to map in a Sync towards your system the
__master-Id__ and the list of the Source key values of all duplicates the deduplicated record is resulting from.
We will later offer some native capability of tagging duplicates as
Duplicate of Id XXX in a dedicated field.