Base format
Do we want an easy way to load data in memory —> replace .to_npy_arrays with a generator-style structure ? apache-arrow ?
Disk storage, should we find a way to have all data (columns and embeddings) into a single format? —> Apache Parquet, I'm not sure it supports having arrays' data (embeddings) in it
Do we want an easy way to load data into tf.data.Dataset or torch.Dataloader ?
—> petastorm library?
—> could be one way to do it, i.e. have a framework-agnostic Parquet storage in which we convert our dataset and then load it easily to TF or torch datasets.
Arrow format.Parquet, especially to avoid having too large files in .cache/, an explanation to Parket/Arrow can be found here.Apache Parquet = designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files
Apache Arrow = in-memory analytics, works with different files formats such as csv or Parquet
Petastorm = enables single machine or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format. Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, PyTorch, and PySpark.