Industrial Historians should be salivating at what is going on in data markets today.

Production of white collar knowledge work is undergoing a radical, victorian era-like industrial shift. The TAM is all of human labor, but its spread out across several subcategories developing within human data where newer players can outcompete generalized incumbents. And even though Mercor offers acqui-hire to many of them, private capital markets, though frothy, are nothing like those in the early 20th century. Even if data titans wanted to employ Standard Oil-like acquisition plays to vertically monopolize, founders today see too much EV with abundant equity capital.

Data contracts are much more easy to eat away at than they were 2 years ago. The market is much more mature, there are fewer

As an avid victoria 3 player, I can’t help but make the analogy. We are developing a fundamentally 5x production method for labor markets high level. There are, of course, sub-industries within that to get there gradually, but this is exemplified somewhat in my earlier piece on AI-enabled services which explores this at a more banal low level.

Problems to make analogous:

  1. Last mile adaptation of RL products to enterprise workflows
  2. Reconciling academia and industry
  3. Luddites —> data centers
    1. When downstream economic effects are more evenly distributed, preponderonce of luddites decrease
    2. Link to Stalenhag piece
  4. Jevons Paradox (overutilized)
  5. Abstractions of work
  6. Cottage industries and guilds, and resistance to AI adoption taking over skilled work
  7. Oil —> data
    1. We can speed up/bring pre-industrial countries and economies into the industrial age via technology sharing and

On labs - vendors won’t be consolidated as long as people continue to pursue verticalization. We are in a world where this sort of data distortion continues to exist because players that develop great relationships with labs on a first initial set of data, then quality drops off, and people start failing to scale quality with quantity. I don’t blame them for this - this naturally has to happen because these startups have growth incentives from taking equity capital and whose founders are naturally empire builders as a result.

In this world - how does it not make sense that a lab will prefer to work with multiple vendors? If we see quality not scale too linearly with quantity, and this is an enduring feature because of capital markets growth incentives, then labs will prefer to keep data markets incredibly fragmented. Free markets competition is good, however, as these conditions allow for human barons to be cracked at by multiple smaller vendors, and for small categories within human data and AI to become large TAMs. They drive the innovation and inherent unbundling of human data markets, which naturally curbs empire building behavior.

Naturally, labs will extend their tendrils down the human data stack in attempt to see what they can do well, because of how valuable this new age commodity is. They will spin up human data teams who attempt to source their own data, or clean raw datasets at scale into sophisticated post training data formats. They will soon find that this is not something they’re good at, nor have the incentive structures of startups to do, so will retract to the peripheries of verticalization.

An interesting thing is happening in data buying practices for the most advanced data domains today. We want less, more high quality, data, as the frontier of models gets pushed outwards.


Yet again, I feel that a venture markets perspective is invaluable for understanding where data markets are headed and why data startups act the way they do. By virtue of taking on equity capital, often from aggressive brand name VCs, you are expected to burn a certain amount and thereby grow according to that burn. This creates behavior (sometimes empire building behavior) that leads to the earlier point I mentioned on scaling quantity without necessarily matching quality but also a drive to enact business decisions without adequate market reciprocation.

Such behaviors are also justified by “our competitor will take our lunch” and “we will learn fast by failing fast.” For those operating in data markets, trust with researchers is everything (evidenced by Afterquery and Mercor), and, just like other SaaS vendors, can be easily broken once and forever. Indigestion is a common trait seen in companies in data markets that scale