Can you design a coding task that breaks the world's best AI models — in your language?
Welcome to a one-week hackathon that challenges you to create coding tasks that expose non-English weaknesses in frontier models.
Tasks are evaluated using the Terminal-Bench framework with the Terminus 2 agent harness against Claude Opus 4.6.
Most AI models handle English tasks well, but struggle with the same tasks when instructions, data, or output requirements are in another language! Your job is to find and formalize those failure modes.
<aside> 📢
Sign up here: https://luma.com/55v3wgi9
</aside>
| Date | Event |
|---|---|
| June 15 (Mon) | Kickoff webinar — rules, workflow, evaluation rubric, live demo. Repo made public |
| June 15–21 | Hackathon week — design, develop, test, and submit tasks |
| June 21 (Sun) 11:59 PM UTC | Code freeze — all PRs must be passing CI by this deadline |
| June 22–23 | Evaluation — all accepted tasks run against Claude Opus 4.6 (15 iterations each) |
| June 24 (Wed) | Awards webinar — winners announced, top tasks showcased |
The top 5 teams (by cumulative score) will be featured in the AI Collective Newsletter and on our company page, and will receive tiered cash prizes: $1,500 for first place, $1,000 for second, $500 for third, $250 for fourth, and $100 for fifth.
Scoring is based on task difficulty — harder tasks earn more points:
| Difficulty | Claude Opus 4.6 Pass Rate (out of 15) | Points per Task |
|---|---|---|
| Easy | 13–15 passes | 1 point |
| Medium | 9–12 passes | 2 points |
| Hard | 4–8 passes | 4 points |
| Very Hard | 0–3 passes | 8 points |
Your team's total score is the sum of points across all accepted tasks. Quality matters more than quantity — one "very hard" task (8 points) is worth more than four "easy" tasks (4 points).
A good multilingual task tests something that is trivial in English but difficult (or impossible) when the language changes. The difficulty should come from the language-specific layer, not from the engineering problem being inherently hard.
Examples of what we're looking for (non-exhaustive):
sorted() gives the wrong order because Devanagari has its own alphabetical rules (varṇamālā)