Interactive Data Frames and Meerkat: A Path to Foundation Models as a Reliable Software Abstraction

Github: https://github.com/hazyresearch/meerkat

The Rise of Unstructured Data

Recent progress in machine learning shows that *foundation models—*large machine learning models trained on massive amounts of data—can perform a remarkably wide range of tasks with reasonable proficiency. These models can even be taught to perform entirely new tasks through in-context learning with a small number of examples, e.g. using text prompts with a large language model. Foundation models range from text-only models like GPT-3, to multi-modal models that involve training on images, text, audio and video data e.g. vision-language models like CLIP.

Over the past year, we’ve been thinking about how foundation models will impact the workflow of technical teams spanning **software engineering, data science, and machine learning. The lines are blurring between these roles—software engineers and data scientists must now contend day-to-day with how to instruct and evaluate model APIs, and integrate these APIs into their workflows.

All of these teams routinely interact with unstructured data types (e.g videos, images, free text, etc.). However, deriving insights from unstructured data requires significant time and human effort for gathering annotations and performing quality control. These investments are out of reach for most teams.

Our bet is this: FMs will lower the barrier to entry for working with unstructured data, and technical teams will increasingly interact with these models to build new tools, surface insights, and deploy software.

Building Interactive Data Frames for Unstructured Data

As unstructured data permeates the work of technical teams, it’s critical that they have the right toolbox for wrangling it. For structured data, teams swear by data frames, like those provided by Pandas and R. Back in 2021, we started wondering: why doesn’t something similar exist for unstructured data?

One reason is that the reliable software abstractions (e.g. NumPy) that power traditional DataFrame operations fall flat when applied to unstructured data. A filter over a structured column (e.g. df[df[”age”] > 18]) can be implemented with one line of NumPy code, but there is no simple abstraction that implements a semantic filter over unstructured data (e.g. df[df[”image”].contains(”person”)]).

What if we viewed foundation models as a software abstraction that processes unstructured data? ****Much like NumPy is to Pandas, this software abstraction would power a data frame for unstructured data.

The problem? FMs are a terrible software abstraction, and we’re not the first to notice this [Bommasani et al., Narayan et al.]. FMs are hard to control (e.g. brittle to prompt wording), often produce undesired outputs (e.g. hallucinate knowledge), and require careful evaluation. The process of using a traditional software abstraction (i.e. reading the documentation and writing code) is very different than the process of using FMs, which lack the predictability of a good software abstraction.

People have found success using FMs by carefully instructing and testing them. But this process is miserable when done in code alone: