<aside> 💡

TL;DR

Yes!

</aside>

Summary

I recently completed the IUCN Red List Assessor Training course, achieving 80% in the final exam to receiving my official certification (you need >75% to pass the exam). Upon completing it, I was curious about how Claude Code would do, so I decided to put it to the test.

So, how does Claude Code do? Pretty good! It passed four out of five exam runs, averaging 80%.

Claude Code’s exam results. The top row is my own personal attempt, the bottom 5 are Claude Code’s. Claude Code got the highest grade of 88%.

Humans are allowed to repeat the exam as many times as needed until they pass, so four out of five is a very good result. Moreover, I am confident the incorrect answers are not due to an innate limitation, but rather just requires more careful context engineering.

I remain confident that AI can can significantly help the IUCN scale up Red List Assessments.

Background:

Last week I become a certified IUCN Red List Assessor, after completing the IUCN Red List Assessor Training course. (Note that this certification does not mean I can now add my own Red List assessments – one still needs to be a species expert or be part of an IUCN SSC specialist group to contribute one.)
To obtain the certification at the end of the course, I had to complete a difficult 3-hour 25 question final exam. An example question looks like:

Even with ChatGPT’s help, I found the exam questions pretty tough (it was an open-book exam, so AI and any online resources were allowed).
While doing the exam, I suspected AI would do really well at this, provided we do some careful context engineering first. This intuition is informed by the fact that the top AI reasoning models are now competing with the world’s best mathematicians and computer scientists in global olympiads. Given my interest in how AI could help with accelerating Red List assessments, I thought as a first step we should see whether it can pass the exam that human assessors are required to pass.
Why does this matter? The IUCN’s biggest bottleneck towards achieving their Red List targets is scale: they are limited by the number of trained assessors available to do assessments and re-assessments. So any way AI can help accelerate the process would be extremely valuable.
To this end, I decided to put Claude Code to the test. I made sure it has access to the same resources I had during the exam – namely:
1. The IUCN Red List Categories and Criteria (38 pages)
2. The IUCN Red List User Guidelines (122 pages)
3. The Mapping Standards and Data Quality for the IUCN Red List Spatial Data (32 pages)
4. The Supporting Information Guidelines (68 pages)
5. The Regional and National Assessment Guidelines (46 pages)

What I did:

First, I needed a way to get the exam questions into a format the AI could easily parse. To do this, I used Claude Code to create the following scripts:
- A script extract_exam.py to parse questions.md from the exam page’s raw HTML.
- A script extract_memo.py to extract a memo.txt from the HTML of a submitted exam attempt.
- A script grade_attempt.py to grade a set of AI-generated answers.txt against the memo.txt
Next I needed to get the IUCN Red List guideline PDFs into a text format that the AI could read.
- For this, I used the Claude Code PDF Skill to create a script parse_pdf.py that takes a PDF and outputs a corresponding markdown file along with its associated images and diagrams. The resulting directory looks like:
I then used Claude Code to design a red-list-assessor-skill that points the AI to the relevant official guidance docs.
- Here’s the SKILL.md:
Next I added a Claude Code \\attempt-exam-question slash command to answer a given question using the red-list-assessor-skill , outputting the final answer and clear, concise reasoning for the answer.
- Here’s the attempt-exam-question.md:
I then added a \\attempt-exam slash command that instructs Claude Code to spin out 25 parallel Tasks , one for each question, and run \\attempt-exam-question on each.
- Here’s the prompt. Note that the exam instructions are the same ones given to human trainers.
I then extracted 5 sets of exams of 25 questions each from the official webpage, and ran Claude Code on each exam. Here are some screenshots showing the Claude Code workflow looks like in action.
- 1: Claude Code starting the exam
- 2: Claude Code in action, answering the questions in parallel and referencing the User Guidelines.
- 3: Screenshot showing Claude Code with completed answers. See token usage and time taken per question.
- 4: Screenshot showing Claude Code completing the exam and aggregating the answers:
- 5: Screenshot showing where Claude Code outputs its answers:
- 6: Screenshot showing an example of Claude Code’s answer to q01.md :