By Sean Cai
More Private Data Markets Comps/Analysis on Substack or on seancai.com

The “Mechanical Turk,” the 18th century version of a black box machine for a multi-turn reasoning task (chess)
Industrial Historians should be salivating at what is going on in data markets today. We’ve built the bona fide combustion engine and now prospecting all around the world for data (oil).
Production of white collar knowledge work is undergoing a radical, victorian era-like industrial shift. The TAM is all of human labor, but its spread out across several subcategories developing within human data where newer players can outcompete generalized incumbents. And even though Mercor offers acqui-hires to many of them, private capital markets, though frothy, are nothing like those in the early 20th century. Even if data titans wanted to employ Standard Oil-like acquisition plays to vertically monopolize, macro trends are driving the supply chain to be split.
Data contracts are much more easy to eat away at than they were 2 years ago. The market is much more mature and knowledge asymmetries fade as more miners enter the data markets. Throughout this entire year, I’ll explore the notion that, in absence of being able to innovate at the continual learning paradigm level, we need to dramatically redefine how we collect data and realistically transform it into evals to match the new SOTA models and deployment practices.
The Industrial Age and the Information Age
In the early 18th century, as the most optimistic and foolhardy industrialists of the early waves of the industrial revolution posited the limits of industrial innovations they could imagine, they landed on machines that, through no obvious mechanism, could produce miraculous outcomes. Such were the misguided attempts at creating machines that “spoke” to emulate speech in Viennese courts, the machines that sought to emulate human reasoning at some scale by some French inventors, and machines that outcompeted humans on reasoning tasks, such as the infamous mechanical turk made to impress Empress Maria Theresa in the image above.
Evidently, there were a few technological breakthroughs that needed to happen before we could even get to a point of delegating high level reasoning to machines. Today, we have the best chances ever at realizing what the inventors of the mechanical turk had in their wildest dreams. And increasingly, we deploy systems that abstract away low level reasoning, spawning modern age white collar luddites and over-investment bubbles, but who ultimately contribute to massive improvements to white collar economies of scale.
Coal, iron, sulfur, lead, and a mishmash of other physical materials powered the physical labor revolution of the Victorian industrial age. Data, alone in its many modalities and white collar representations, will power the white collar labor revolution of the Information age. Soon, as even general purpose robotics mature, it might power a blue collar labor revolution as well, abstracting away low level reasoning at all levels of the economy.
We always face an issue between balancing pattern-matching on our most applicable historical examples and generalizing to new situations with considerations to new technologies. In this case, let’s examine the things that are most probably generalizable from the last economic revolution, and the things that are not:
What Industrial History Actually Generalizes
- Human power dynamics towards the resources that most directly translate to throughput:
- Just as how England and the rest of the industrializing European powers resorted to economic, and finally physical colonization, in order to extract raw materials such as rubber, oil, and silk/dyes/fabrics in order to fuel their continued expansion, I predict we’ll see similar dynamics in data acquisition
- Already, we see a sort of “data imperialism” where RL env companies in the US today buy and arbitrage real world datasets created by those in other countries who have little to no AI talent - I’ll coin the term “mimetic imperialism” to describe this.
- Low level labor examples, data, and innovations from 3rd world countries will power the AI information economies of the new age, who have a combination of MLE talent, capital markets, and amenable talent policy proliferation
- Human nature and governments haven’t changed too much (only the outlawing of direct physical colonization), so I don’t see this as being markedly different.
- Luddites (and their unavoidable proliferation)
- Those who rallied around Ludd in England will forever be immortalized for coining this term. Always, as there must be owing to the mismatch between human culture, the governance systems we create, and technological advancement by private individuals, there will be pockets of individuals who can’t ever match qualifications with new jobs
- I’ve thought about this a lot. This is reasonable and unavoidable. This is a negative externality of piles of other governmental systems whose positive externalities are necessary to technological innovation, so don’t be elitist about this.
- Some initiatives against data center rollout, nuclear rollout, as well as generally uneducated individuals on AI are the biggest proponents of this category as of Dec 2025
- Many more spinning looms will be destroyed and decried as useless “AI slop” by the artisans of today
- Policy Innovation and Technology go hand in hand
- The distribution of wealth and production are large determinants of the best governance structure a society should have
- A society dominated by landowners will extract more tax revenues from land-based or per capita-based taxation, while a society with a larger distribution of wealth and a strong white collar middle class can extract high and more fair revenues from proportional and graduated taxation
- This is a common theme espoused by the struggles by original Bolsheviks to transform Russia from an agrarian society into an industrialized one such that communal ownership made more sense - wealth and knowledge were still too concentrated in few individuals from previous systems
- We need policy innovation to match wealth accumulation and balance minimum standard of living expectations to guard against societal unrest
- People will generally find the roles where they produce the most economic value, given that they have the prerequisite education
- Education, in a techno-laissez-faire view, is resultantly the most important public institution to keep cutting edge
What’s different today:
- The data associated with the labor that created that data is generally uniquely valuable to the labor that created that data; resources are not immediately completely fungible
- Even if we do imperialize datasets from abroad (eg Brazil, South Africa, etc.), the models and evals trained on that data will, of course, best match to those users, especially in white collar settings
- A direct example is training accounting models on IFRS (used in the majority of the world) versus the US’s GAAP methods, but at some point differences like these will probably largely be treated as context management problems rather than training problems
- We can reasonably expect China to develop as a proto-Meiji era Japan in this regard, copying the methods of industrialization from Western powers but ultimately relying on domestic resources to industrialize as best as possible until natural limitations require it to look outwards
- This is in stark comparison to natural resources like coal, whose composition doesn’t really change where its sourced. Evals today need to be dynamic and constantly changing every week, with real world data pipelines that are generally as diverse as possible. This is the short term focus of my current day work.
- Evals themselves will become a sort of “finished good” in the human data supply chain with some elements of non-fungibility (they will be best used in the cultural contexts, and by extension geographies, wherever created)