Interactive Data Frames and Meerkat: A Path to Foundation Models as a Reliable Software Abstraction

Github: https://github.com/hazyresearch/meerkat

The Rise of Unstructured Data

Recent progress in machine learning shows that *foundation models—*large machine learning models trained on massive amounts of data—can perform a remarkably wide range of tasks with reasonable proficiency. These models can even be taught to perform entirely new tasks through in-context learning with a small number of examples, e.g. using text prompts with a large language model. Foundation models range from text-only models like GPT-3, to multi-modal models that involve training on images, text, audio and video data e.g. vision-language models like CLIP.

Over the past year, we’ve been thinking about how foundation models will impact the workflow of technical teams spanning **software engineering, data science, and machine learning. The lines are blurring between these roles—software engineers and data scientists must now contend day-to-day with how to instruct and evaluate model APIs, and integrate these APIs into their workflows.

All of these teams routinely interact with unstructured data types (e.g videos, images, free text, etc.). However, deriving insights from unstructured data requires significant time and human effort for gathering annotations and performing quality control. These investments are out of reach for most teams.

Our bet is this: FMs will lower the barrier to entry for working with unstructured data, and technical teams will increasingly interact with these models to build new tools, surface insights, and deploy software.

Data Science Teams. Organizations invest people-hours in order to extract insights from unstructured data. For example, one large hospital system calculated the fraction of preventable adverse events by hiring medical providers to scour clinical notes (Bates *et al., 2023)*. FMs have the potential to bring this resource-intensive process within reach of more hospitals (Agrawal et al., 2022).
Software Engineering Teams. Until this year, only highly-resourced teams could build autocomplete features over unstructured text (e.g. Google Docs Smart Compose, Microsoft Word AutoCorrect). Now, we see much smaller teams produce autocomplete features with FMs that are much more expressive **(e.g. Notion AI, Lex).
Machine Learning Teams. Well-resourced organizations use large labeling teams to identify groups of unstructured data points where the model is making mistakes (e.g. Tesla’s data engine). But, last year we showed that foundation models can help machine learning teams identify systematic errors made by models, potentially reducing the labeling burden (Eyuboglu et al. 2022).

Building Interactive Data Frames for Unstructured Data

As unstructured data permeates the work of technical teams, it’s critical that they have the right toolbox for wrangling it. For structured data, teams swear by data frames, like those provided by Pandas and R. Back in 2021, we started wondering: why doesn’t something similar exist for unstructured data?

One reason is that the reliable software abstractions (e.g. NumPy) that power traditional DataFrame operations fall flat when applied to unstructured data. A filter over a structured column (e.g. df[df[”age”] > 18]) can be implemented with one line of NumPy code, but there is no simple abstraction that implements a semantic filter over unstructured data (e.g. df[df[”image”].contains(”person”)]).

What if we viewed foundation models as a software abstraction that processes unstructured data? ****Much like NumPy is to Pandas, this software abstraction would power a data frame for unstructured data.

The problem? FMs are a terrible software abstraction, and we’re not the first to notice this [Bommasani et al., Narayan et al.]. FMs are hard to control (e.g. brittle to prompt wording), often produce undesired outputs (e.g. hallucinate knowledge), and require careful evaluation. The process of using a traditional software abstraction (i.e. reading the documentation and writing code) is very different than the process of using FMs, which lack the predictability of a good software abstraction.

People have found success using FMs by carefully instructing and testing them. But this process is miserable when done in code alone:

Say we’re trying to filter a dataset of paintings based on artistic style. A vision-language FM can produce scores to filter on, but how do we verify that the FM is correct if we can’t see and label the images? Visualizing images
Imagine we’re trying to extract structured data from PDFs using a foundation model, but it’s not working as expected. How do we provide feedback to the model if we can’t annotate or highlight the PDFs?
FMs can be used to identify slices of data where a different machine learning model is systematically making mistakes. But, how can we interpret the discovered slices if we can’t inspect the data inside them?