Blog by Sabri Eyuboglu ([email protected]), Arjun Desai ([email protected]), Karan Goel ([email protected])
This blog post introduces Meerkat, a new Python library for wrangling complex datasets across stages of the ML lifecycle. You can find the project on GitHub.
Data is the lifeblood of machine learning. From training and validation data to predictions, embeddings, metadata, and more; it drives all parts of the machine learning development process. Organizing and managing all of this data is challenging, but it's critical for making machine learning work in practice.
This blog post introduces Meerkat, a new data library that we're building to help practitioners and researchers wrangle their data. Fair warning, Meerkat is pretty young (the project is less than 3 months old), and it certainly doesn't address all of the data problems that arise in ML. In this post, we'll highlight the problems that we think are particularly exciting and that Meerkat is designed to address. We'll talk about where we are with Meerkat, and where we hope Meerkat is headed.
Meerkat is motivated by a few trends in ML, many of which have directly impacted our own research:
So what can Meerkat do? Meerkat provides the DataPanel
abstraction, a data structure that takes inspiration from Pandas and the DataFrame
. The DataPanel
facilitates interactive dataset manipulation, can house diverse data modalities, and lets you evaluate models carefully with Robustness Gym. We built DataPanel
s like DataFrame
s because they're naturally interactive and work seamlessly across development contexts: Jupyter Notebooks, Python scripts, and Streamlit. Like them, we hope the Meerkat DataPanel can be an interactive data substrate for modern machine learning across all stages of the ML lifecycle.
Table of Contents
Of course, there are other data structures out there that we could use to manage our machine learning data. Why don't they suffice?
Below we outline a set of desiderata informed by use cases encountered throughout the ML development pipeline. Popular data structures typically fall into two camps: (i) those that support complex data types and multiple modalities (e.g. PyTorch Dataset, TensorFlow Dataset – desiderata 1-3) and (ii) those that support manipulation and interaction (e.g. Pandas DataFrame – desiderata 4-6). With the Meerkat DataPanel
, we support all of these desiderata in one data structure.
__getitem__
and __len__
implementations.DataFrame
does not support datasets that are larger-than-RAM, with extensions such as Dask and Modin explicitly designed to ease this restriction.__getitem__
, they implicitly support multi-modal data. However, an interactive, user-friendly data structure would benefit from making the support for multi-modality more explicit (e.g. by storing each modality in a separate column, each with an assigned type).[apply](<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html>)
and [map](<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html>)
allow users to create new columns from existing ones and operations like [concat](<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html?highlight=concat#pandas.concat>)
and [merge](<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html>)
(i.e. database-style join) facilitate organizing all these columns with a relational data model. We'd like our data structure to bring these data wrangling features of Pandas to large datasets and complex data types.