LILTBench Hackathon

Can you design a coding task that breaks the world's best AI models — in your language?

Welcome to a one-week hackathon that challenges you to create coding tasks that expose non-English weaknesses in frontier models.

Tasks are evaluated using the Terminal-Bench framework with the Terminus 2 agent harness against Claude Opus 4.6.

Most AI models handle English tasks well, but struggle with the same tasks when instructions, data, or output requirements are in another language! Your job is to find and formalize those failure modes.

<aside> 📢

Sign up here: https://luma.com/55v3wgi9

</aside>

<aside> 🚨

Repo : https://github.com/lilt/liltbench-tasks-public

</aside>

Schedule

Date	Event
June 15 (Mon)	Kickoff webinar — rules, workflow, evaluation rubric, live demo. Repo made public
June 15–21	Hackathon week — design, develop, test, and submit tasks
June 17 (Wed) 11:59 PM UTC	Sneak Peek — any PRs ready by this deadline will be merged, and will make it to a mid-week leaderboard sneak peek!
June 21 (Sun) 11:59 PM UTC	Code freeze — all PRs must be passing CI by this deadline
June 22–23	Evaluation — all accepted tasks run against Claude Opus 4.6 (15 iterations each)
June 24 (Wed)	Awards webinar — winners announced, top tasks showcased

Prizes & Recognition

The top 5 teams (by cumulative score) will be featured in the AI Collective Newsletter and on our company page, and will receive tiered cash prizes: $1,500 for first place, $1,000 for second, $500 for third, $250 for fourth, and $100 for fifth.

Scoring is based on task difficulty — harder tasks earn more points:

Difficulty	Claude Opus 4.6 Pass Rate (out of 15)	Points per Task
Easy	13–15 passes	1 point
Medium	9–12 passes	2 points
Hard	4–8 passes	4 points
Very Hard	0–3 passes	8 points

Your team's total score is the sum of points across all accepted tasks. Quality matters more than quantity — one "very hard" task (8 points) is worth more than four "easy" tasks (4 points).

What Makes a Good Task?

A good multilingual task tests something that is trivial in English but difficult (or impossible) when the language changes. The difficulty should come from the language-specific layer, not from the engineering problem being inherently hard.

Examples of what we're looking for (non-exhaustive):