Historical Context: Data Labelling from Scale AI days

In the mid-2010s, Scale AI pioneered the third-party data labeling market at a time when most enterprises were unsophisticated buyers of AI data. Back then, few companies had in-house machine learning (ML) expertise or applied ML teams, so they relied on external services to curate and annotate training datasets. Scale AI’s initial focus was on labeling “hard tech” data (like images for computer vision in self-driving cars) and other straightforward annotation tasks. Cutting-edge ML was largely academic and not yet powerful enough for most commercial use cases, which meant early enterprise demand for data was limited and naive.

This began to change with the transformer revolution (circa 2018–2020) and the rise of foundation models. As models like GPT-3 demonstrated astonishing capabilities from large-scale training, the need for ever-increasing “frontier” data surged. The most sophisticated early buyers of such data were Mag7 labs like OpenAI, DeepMind, and Anthropic who suddenly needed massive, high-quality datasets and human feedback to push their models’ performance. Scale AI’s success foreshadowed a new ecosystem of AI data providers, serving hungry labs whose only bottlenecks for improving frontier performance was the quality of data itself, represented in many ways like task complexity, horizons, tool calls, and modalities.

This set the stage for an evolution from simple one-off annotation jobs to more complex, long-horizon training tasks, culminating today in rich reinforcement learning (RL) environments that can mimic sophisticated workflows. We’ve gone from outsourcing basic labeling (e.g. drawing boxes around stop signs) to outsourcing the creation of entire simulated work environments where AI agents can learn by doing.

The trend that we've seen over the last three years is that models need less data, but the data became harder to get as it had to simulate enterprise workflows whose documentation was both rare and locked up in domain experts.

The Setup

Pretraining as we know it is ending and becoming less differentiating - this has been known for a year + now.

Moreover, pretraining is insufficient for tasks like planning, tool use, and long reasoning chains.

Implicitly, pretraining base models is already mostly done, and mid-training, rl, environment work, will drive model improvements in the next two years. Post training will serve the purpose of improving deployment, safety, alignment, etc.

RL envs are the hottest new buzzword (for about 2 months now at this point). Maybe its world models and VLMs elsewhere (soon tm), as well as continuous learning as RL’s shortcomings become evident. Important to remember - RL and RL envs will just be one component of a rich ecosystem for model learning, especially in context of what gets done at the broader level in mid/post training (human data/rubrics/evals and more).

A first wave of AI startups saw pre-training as a way to create best in class ai products for verticals. Outputs of their products are now strikingly similar to base foundation model efforts - maybe with a bit more engineering work with basic RAG and a document vault with embeddings. As labs’ models increasingly generalize better across anything, these startups lack the talent to catch up with

At a generalized higher level, key questions that are being answered today include: what infrastructure can we build in the mid/post training layer to better simulate and improve model performance on more complex real world tasks? What can we improve in model architecture itself to expand and consider more modalities? How can we build around interweaved data specifically to swap between modalities effectively?

I’m not a researcher so I won’t go into depth on the above, but I recommend reading more of Doria’s work which introduces some key research questions. I’ll focus here on RL env companies as they relate to early stage investing.

Staging

The democratization of the recipe for good RL + reasoning agents that have passed the critical threshold of being good enough for most white collar tasks has passed.  Sequoia calls it the bottom 5% rule.  Pleias researcher calls it the democratization of the recipe for good RL + reasoning.  Simply put, labs aren’t going to be the only hyperspenders for data infra.  The meteoric rise of frontier data and eval companies like Mercor is only a prototype of what is quietly being unleashed.

In the days where Scale was the only data provider, everybody was an unsophisticated buyer, and cutting-edge ML remained both academic and not good enough for commercial cases. Scale found its first markets in vision and hard-tech data.  With the explosion of AI talent investment in the transformer model, and the need for ever-increasing frontier data to train ever-frontier capabilities in foundation models, the first enterprise customers that grew sophisticated enough to buy data at venture scale reified - labs.

Today, I’m hearing whispers of how the diffusion of ML knowledge for good RL + reasoning agents is being arbitraged in startup land.  The hipster early teams here have been very quiet about this - forward deployed teams like Forge.  Soon (or maybe already!), a winning strategy for picking a large market to just build in will be to gather a bunch of cracked MLEs with connections to decacorn hyperspenders to shore up their lack of MLE expertise (who doesn’t have a lack of great engineers nowadays!) and build high value white collar automations.

Incubation and roll-up first platforms with the ability to hire substantial AI talent have seized on this opportunity.  General Catalyst’s pivot to this strategy was first evident when they acquired a hospital chain in late 2023.  Jared Kushner and Elad Gil’s Thrive’s new company brain.co is a flagrant reiteration of this in an incubation form.  It seems that megacaps have come to the conclusion that building AI-native products is an endeavor best undertaken for brand new organizations with draconian top-down control rather than trying to introduce AI to existing fortune 500 orgs.