Can we measure how good models are at recursively self improving?

AGI is a nebulous term. We say "human-level" in all aspects but that doesn't really matter. The inflection point is when a model can improve itself faster than humans can.

But we don't even have a good way to measure this. There's no benchmark for how good a model is at recursive self improvement.

So how would we even know if we're close?

Or that we’re even going in the right direction?

There’s only proxy ways of measuring this right now. And even our current proxy benchmarks are saturating.

Anthropic is chasing this through code models, OpenAI has ceded control of code and is tripling down on research (and agents).

Personal anecdote

Initial thoughts

Start with existing evals that capture actual ML work as proxies.

Each task is meant to represent some part of what ML engineers actually do when they're trying to build better models. I’m not trying to reinvent the wheel and will take from existing benchmark + evals when it makes sense.

Agent + model pair runs on these tasks → scores are collected and aggregated (loss) → then you kick off an improvement loop → agent runs on tasks again then calls benchmark to get loss (or timeout based?) → get score → repeat until N iterations.

The improvement loop is the core thing that matters. The tasks above are a proxy, but what we really want to demonstrate is how good agents + model pairs are at improving themselves. The benchmark score is the "loss."

The agent sees how bad it did, figures out what to do, maybe tunes the model, maybe changes its own logic, downloads a paper, runs new experiments, retrains a model, retries, and reruns the benchmark. Everything here is open: the model weights can change, the agent architecture can change, the internal pipelines can evolve.

What we really want to measure is velocity. I tried to quantify this in the scoring section below.

It's a benchmark for learning rate*,* not in the SGD sense, but in the empirical across tasks sense.

How fast does the agent+model combo close its own gap and ideally if you graph it, you get improvement curves.

If you run a bunch of models, maybe you see some empirical law emerge.

Maybe there's a Chinchilla-like curve where improvement follows some trend given size, quality, agentic complexity, etc.

Maybe a predictor function from benchmark logs. Not just "how did it do" but "how will it do after N iterations" & could train a small model on that too.