https://github.com/DeepChainBio/bio-datasets


Brainstorm on datasets formats and workflow June 16, 2021

To discuss

Ideas

Apache Parquet = designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files

Apache Arrow = in-memory analytics, works with different files formats such as csv or Parquet

Petastorm = enables single machine or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format. Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, PyTorch, and PySpark.