<aside> 💡
Tl;DR:
Since my last update, I built a full web app to show Claude Code live in action taking the exam, and demo-ed the app to various interested parties. Here’s a screenshot from the web app:

A screenshot of the web app showing Claude Code’s attempts at the IUCN Red List Assessor Exam
Moreover, with some small prompt engineering, Claude Code Sonnet now consistently passes 8 of 8 exams, averaging 86% – comfortably attaining the pass mark of 75%, and far better than I can do!
Here are some closing thoughts to wrap up this project.
The demo landed very well with various audiences, including:
At the very least I hope this project can give a glimpse to the IUCN about what’s possible here with AI, in particular, in addressing their challenges in validating, maintaining and scaling the Red List assessments.
At the EEG demo, I received some audience questions about (1) whether a large model (aka Sonnet) is necessary, and (2) how the models do with their parametric knowledge only (i.e. without access to the IUCN guidelines docs). I ran the experiments, and the results were conclusive: Claude Code Sonnet (86% average) significantly outperforms Claude Code Haiku (61%), and without access to the guidelines the average grade drops to 59%, showing the model can’t leverage parametric knowledge only.

Two takeaways: (1) Sonnet significantly outperforms Haiku (86% vs 61%), (2) the model’s parametric knowledge is insufficient to pass the exams (only 59% without the guidance docs).
Anil pointed out a valuable use case for a tool like this that I hadn’t considered – the pedagogical value for trainee assessors.
Neil Burgess, Chief Scientist at UNEP-WCMC, suggests an agentic workflow like this could also be super useful for their workflows with CITES to protect endangered species from illegal trade. I can definitely see that being very valuable – my only worry here is that I’m being pulled in many directions at the moment, and Anil has wisely advised that I need to protect my focus!
I have learned so much from this process. Mainly, it’s been eye-opening how fast we can move from idea to action using agentic AI. The scope of possibility for rapidly testing research ideas in incredible.
Another important question on my mind here though, is to get clear on what research is involved in a given project, as opposed to just AI-driven software engineering – even if accelerating Red List assessments would be extremely impactful work for conservation. On the other hand, working with AI agents is new for all of us – and so designing effective workflows for this is certainly novel terrain. For example, it’s not yet clear how to best connect these agents to Anil’s corpus of scientific literature, an exciting future direction. But it’s definitely still important to be mindful of this research vs application trade-off.
Next steps for taking this further will be to approach the IUCN to (a) gauge their interest and (b) see if we could access their SIS data with its full history of Red List assessments. Michael Dales has mentioned he can put me in touch with the head of the Red List Unit, Craig Hilton-Taylor, who definitely seems an appropriate person to talk to about this. So we’ll see where this goes from here – but for now, I am hopeful this project at least serves a valuable PoC and example of how AI can contribute positively in the conservation domain.