Datasynthesis131. Overview

Library to generate synthetic tabular data and evaluate quality of generated data.

The library was developed as part of the final qualification work at the ITIS Institute, Kazan Federal University.

The main workflow of the library

Screenshot 2024-05-21 at 13.50.15.png

Library components

Screenshot 2024-05-21 at 13.57.18.png

Synthesizer

A class to generate synthetic tabular data. We use open-source The Synthetic Data Vault python library, which is considered top-tier for this purpose. It ensures high realism and statistical fidelity in the generated data and is scalable for large datasets. Additionally, SDV allows for table metadata configuration to specify the type and format of generated data. Its user-friendly APIs and comprehensive documentation make it easy to integrate and use We compared 4 synthesizers of SDV on different types of datasets (see Model comparison section):

As a result, we have developed an algorithm for choosing the best generation method depending on the characteristics of the input dataset. This is the core of the synthesizer.

Algorithm to choose best generation method depending on dataset

Algorithm to choose best generation method depending on dataset

QualityEvaluator