<aside> 🎓
Build the science of evaluating frontier AI in high-stakes domains.
</aside>
Most AI evaluation today is a mess. Benchmarks saturate, domain experts disagree on what “correct” even means. Often, real-world performance diverges from leaderboard numbers in ways nobody can predict.
The field needs better methodology, better measurement, better infrastructure, and better theory. Most of that work isn’t getting done because the people best positioned to do it (PhD students or academics) usually don’t have access to frontier models, customer-grade evaluation problems, sufficient API credits/funding, or the engineering infrastructure to run experiments at scale.
Vals builds evaluations for AI in legal, finance, science, and other high-stakes domains. We work with the leading AI labs and Fortune 500 companies, which means fellows get access to problems and infrastructure that are difficult to replicate in a university lab. Our goal is to develop better benchmarks, evaluation techniques, and ensure we’re measuring what matters- as a fellow, you’ll be helping tackle some of these problems!
Fellows apply with a proposal for a new benchmark they want to build. If accepted, the fellowship is time and support to design, implement, and validate that benchmark. Some domains we’re interested include:
These are example domains we’re particularly interested in seeing applications for, but other domains and benchmark ideas are welcome. We are looking for benchmarks that have construct validity, and that are reflective of their usage in the domain corresponding to the benchmark.
While preference will be given to applications that propose building new benchmarks, we will also consider strong applications that deal with science-of-evals work. Such work can include: