Vals Fellowship

<aside> 🎓

Build the science of evaluating frontier AI in high-stakes domains.

</aside>

At a glance

Duration: 3–6 months
Location: We are excited to have people in-person working in tighter partnership, but also happy to have a more remote partnership with check-ins.
Focus: hard, unsolved problems in AI evaluation with real frontier models + real customers
Apply here: Vals Fellowship Application
Deadline: Apply by June 30, 2026

Why this fellowship exists

Most AI evaluation today is a mess. Benchmarks saturate, domain experts disagree on what “correct” even means. Often, real-world performance diverges from leaderboard numbers in ways nobody can predict.

The field needs better methodology, better measurement, better infrastructure, and better theory. Most of that work isn’t getting done because the people best positioned to do it (PhD students or academics) usually don’t have access to frontier models, customer-grade evaluation problems, sufficient API credits/funding, or the engineering infrastructure to run experiments at scale.

Vals builds evaluations for AI in legal, finance, science, and other high-stakes domains. We work with the leading AI labs and Fortune 500 companies, which means fellows get access to problems and infrastructure that are difficult to replicate in a university lab. Our goal is to develop better benchmarks, evaluation techniques, and ensure we’re measuring what matters- as a fellow, you’ll be helping tackle some of these problems!

What fellows do

Fellows apply with a proposal for a new benchmark they want to build. If accepted, the fellowship is time and support to design, implement, and validate that benchmark. Some domains we’re interested include:

Long horizon agentic benchmarking in computer use or software engineering
Cybersecurity
Finance, law, insurance
AI for Science evals- research-level mathematics, biology, materials science, theoretical physics, and more.

These are example domains we’re particularly interested in seeing applications for, but other domains and benchmark ideas are welcome. We are looking for benchmarks that have construct validity, and that are reflective of their usage in the domain corresponding to the benchmark.

While preference will be given to applications that propose building new benchmarks, we will also consider strong applications that deal with science-of-evals work. Such work can include: