https://github.com/DeepChainBio/bio-datasets

Brainstorm on datasets formats and workflow June 16, 2021

To discuss

Goals/perspectives for the project ?
- project was paused for a month since we had other projects to work on but this is still in the backlog
- Main idea = to be aligned with bio-transformers, and to be used to train models
  - Also depends on outcome of the meeting with Rostlab June 16, 2021 —> updates later with JB and Aurélien
https://app.deepchain.bio/hub/datasets —> not updated ?
- Not updated for now, to see with the dev team if there is a link between bucket and front
  - [ ] Ask on slack
Format/structure (see below)
- Big picture implementation + actions below

Ideas

Base format
- csv (+ txt/json ?) for columnar data
- npy (+ npz/pkl ?) files for embeddings
- How do we do when a column is matrix data, like with !18
Do we want an easy way to load data in memory —> replace .to_npy_arrays with a generator-style structure ? apache-arrow ?
Disk storage, should we find a way to have all data (columns and embeddings) into a single format? —> Apache Parquet, I'm not sure it supports having arrays' data (embeddings) in it
- Concatenation of columns + embeddings into numpy array converted then to Parquet
  - seems do-able —> https://stackoverflow.com/a/65464002/11194702
- In that case, each time a dataset is added, we need to convert it ?
Do we want an easy way to load data into tf.data.Dataset or torch.Dataloader ?

—> petastorm library?

—> could be one way to do it, i.e. have a framework-agnostic Parquet storage in which we convert our dataset and then load it easily to TF or torch datasets.
- In order to use in-memory computing and load large data in memory, we could use the Arrow format.
- This format can be combinated with Parquet, especially to avoid having too large files in .cache/, an explanation to Parket/Arrow can be found here.

Apache Parquet = designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files

https://github.com/apache/parquet-format

Apache Arrow = in-memory analytics, works with different files formats such as csv or Parquet

https://arrow.apache.org/docs/index.html

Petastorm = enables single machine or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format. Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, PyTorch, and PySpark.

https://github.com/uber/petastorm
uses pyarrow library to read Parquet files