RL Envs Dump | Notion

Human Data & RL Envs (Direct)

Mechanize – Building and selling RL environments that simulate real-world work scenarios (initially focused on software engineering tasks). Models are placed in these environments to perform objectives like adding a feature or debugging code, and their performance is automatically graded to provide feedback*[31][32]*. Mechanize explicitly sells to leading AI labs as customers, aiming to automate “all valuable work in the economy” long-term, starting with coding[33].
Halluminate – Providing realistic “sandboxes” and datasets to train computer/browser-use agents for knowledge work*[34][35]*. In practice, Halluminate builds evaluation infrastructure and benchmarks for AI agents – essentially a testing suite to validate and fine-tune an AI’s ability to use the internet or software tools. (Think of it as QA and training data for agent behaviors like browsing websites, filling forms, etc.)
Theta Software – Initially focused on enabling AI agents to learn from experience via memory. Theta started off offering an “intelligent memory layer” that plugs into any agent with a few lines of code, allowing the agent to retain knowledge across runs and improve over time*[36][37]*. Now, Theta has evolved from their browser infrastructure roots to offer generalized RL environments for the top 3 labs’ most pressing needs, combining a combination of browser use and MCP server capability.
Mercor – Now not just a data/recruiting marketplace, Mercor is doubling down on evaluation tools for the RL era. The company signaled it’s adding software to support RL training feedback loops[16]. In essence, Mercor can define custom reward functions and evaluation metrics for a client’s RL tasks, leveraging its network of domain experts. It’s an example of a data company pivoting into an RL infrastructure provider. As CEO Brendan Foody says – Mercor is building evals for everything – now branching into RL environments among other things.
Turing – Originally a global talent marketplace for software engineers, Turing has become a key data and coding provider for leading AI labs. With millions of engineers on its platform, the company evolved from staffing to supplying human-supervised code, domain data, and micro-tasks for fine-tuning pipelines [1][2]. Having established deep vendor relationships with hyperspenders like OpenAI, Turing is positioned to expand “up the stack” — from contract human-in-the-loop work into mid/post training infrastructure, such as reward modeling, evaluation tools, and RL feedback loops.
Handshake AI – Leveraging Handshake’s massive verified expert network (millions of students/grads, including ~500K PhDs), Handshake AI connects domain specialists with AI labs to perform model validation, prompt engineering, expert critiques, and labeling for frontier models*[0][4][8]. It acts as a data/eval provider to labs, packaging Handshake’s trust and sourcing infrastructure into high-quality human feedback and domain evaluation services. Given its embedded relationships with large educational and hiring networks, Handshake AI has a pathway to expand into RL feedback loops, continual evaluation pipelines, and hybrid human+agent training systems.
Surge AI – Another data-labeling leader pivoting toward RL and agent evaluation. Surge is known to offer an extensive suite of LLM evaluation frameworks and human feedback pipelines. As it scales (with a potential $25B valuation funding on the horizon[17]), Surge is likely to introduce its own platforms for enterprises to train and monitor AI agents – effectively becoming an end-to-end RL solution vendor. Known to be the highest quality data vendor for labs in the early half of 2025, there are reports of churn amongst labs like Anthropic for pricing concerns. Nevertheless, there are also currently rumors of Surge raising 1B+ currently, a departure from their previously bootstrapped status.
Idler – Billed as the “Scale AI for RL environments,” Idler is building a platform to generate RL environments on-demand, at scale[40]. Their pitch is for any AI team (research or product) that needs custom environments but doesn’t want to wrestle with complicated infrastructure. Idler will let you specify the scenario and will handle the creation and orchestration of that environment in the cloud.
Kaizen – Offering an easy-to-use platform for reinforcement learning experimentation (making RL “one-click” to run, akin to launching a server). Kaizen Lab is known for its work on scalable RL training interfaces and has been cited alongside Mechanize as making RL more accessible[27]. It likely provides tools for enterprises to set up continuous learning loops on their proprietary tasks.
Fleet – An emerging player (little public info available), but by context, Fleet is likely developing RL environments or multi-agent training simulations. (Speculatively, the name hints at possibly multi-agent or “fleet” management scenarios, but it’s part of the cohort of environment startups and has been mentioned in VC discussions[41]). Fleet has been around pre-RL environment craze and is backed by Sequoia in the space.
Plato – Focused on browser-based agents as opposed to coding, Plato provides “live training environments for browser agents” to train and test AI that interacts with web apps[42]. Essentially, Plato can simulate web browsing tasks in realistic web page environments, useful for agents that do things like scrape data, fill web forms, or navigate online workflows. Plato is focused on building infrastructure that allows for easy replication of advanced browser applications, via reconstructing web environments from network request recordings, in RL environments for deterministic tasking, like HubSpot and Salesforce.
Habitat – Developing extensive libraries of tasks and problems as environments, with an emphasis on breadth and diversity of scenarios. One source describes Habitat’s offering as “hundreds of diverse, programmatically verifiable problems just out of reach of current models”[43] – implying Habitat is curating a wide array of challenges to push agent capabilities. The name coincidentally matches Meta’s robotics simulator, but here it refers to a startup likely targeting the text-and-software domain (not physical robots). Habitat’s approach of many small tasks can help labs and companies rigorously evaluate where an agent’s breaking point is, then train beyond it.
Hud.so – Building infrastructure for evaluating and training computer-use agents, Hud.so provides live environments and benchmarks for AI agents to interact with software, browsers, and operating systems [3]. Their platform includes task suites like OSWorld and SheetBench, designed to test real-world workflows such as form-filling, navigation, and document editing. While today Hud.so emphasizes evaluation and benchmarking, its architecture naturally extends into full RL training loops: integrating feedback, reward modeling, and safe exploration for enterprise and lab clients who want continuous improvement of agent performance.
Isidor.ai – A new player focused on data procurement and processing for mid/post training for financial tasking for frontier models, and with a team built with finance domain experts and MLEs, Isidor focuses on high quality evals with a focus on reverse engineering real-world workflows at scale. With a focus on automating pipelines for top decile quality data and reasoning trace data made from reverse engineered workflows rather than carefully constructed deterministic task creation from start to finish, Isidor is building top quality mid/post training RL data/tooling task by task in white collar work.
Micro1 - Originally an AI-driven software outsourcing platform, Micro1 now provides AI-assisted coding environments and developer co-pilots for enterprise clients. Their system pairs human engineers with AI tools to accelerate full-stack development and QA. With integrations across major codebases, Micro1’s recent pivot positions it as both a human-in-the-loop coding data provider and a real-time evaluation layer for AI coding models — making it an upstream complement to RL environments for code generation and debugging.
**Invisible Technologies -** Dataset as a service provider for clients like Cohere, often bucketed with Scale.
**Sepal AI -** Builds expert-verified RL environments and benchmarks for labs and enterprises. Its contract-based model mirrors Isidor’s early phase, but Isidor differentiates through proprietary data access and the “back-forward” workflow methodology that captures richer reasoning traces.
**Snorkel AI -** A pioneer in programmatic labeling, now expanding into evaluation and feedback tooling. Snorkel offers breadth and automation.