Data Download:

https://indiana-my.sharepoint.com/:f:/g/personal/zhu11_iu_edu/EkFJ17EHX59LsO1Ekfc0TPkBi7CipeIHccd4wjb0CRhjzQ?e=qlt6aH

Description:

Task 4 contains 500,000 hashed records in dataset A and 500,000 hashed records in dataset B. In our published training dataset, there are 11 columns:

First Name Last Name Gender SSN Birth Date Email Phone Address Share error place

Record features:

In the csv file, the first 9 columns are the features of the record. each feature has been hashed by Sha256.

There are 10% of data from dataset A and B are common. Please note that except “gender” and “SSN”, each feature may have a different error rate(from 2% to 35%). And similarly the missing value. Detail shows in below:

% missing value % Error rate
First name 0 2
Last name 0 2
SSN 70 0
Birth Date 15 5
Email address 40 20
Tel phone # 25 35
Address 5 15
State
10 10
Gender 1 0

Validation features:

The last two columns will not appear in the final competition dataset, it is only used for partitioners to validate their algorithms.

Column “Share”:

This column has contents which is either True or false

True represents this record is a common record in both data A and B. False means not

Column “error”: The element in the “error” column could be features of the record. Represent which feature has error in this record.

About Missing value:

The missing value has also been hashed! You will find the hash value of “empty” by the most commonly element in the .csv file.