What Counts as Illegible Reasoning?

Summary

Illegible reasoning in LLMs has been observed in OpenAI models, and understanding this behavior would be beneficial for AI safety research. This post describes challenges with reproducing this behavior in open models and limitations of LLM-as-judge strategies for detecting illegible reasoning.

Illegible reasoning is relevant for AI safety

Both Apollo Research [1] and METR [2] have observed illegible reasoning in OpenAI models, where the model’s reasoning includes incomprehensible snippets like “parted disclaim marinade” but its answer is perfectly legible. We should investigate whether this behavior is load bearing, meaning that models use or even require illegible snippets to maintain task performance. If so, this behavior provides a unique opportunity to understand how models use reasoning tokens beyond relying on their semantic content.

If illegible reasoning is load bearing, it may also be a limitation of chain of thought monitoring [cite] as a safety strategy. Monitors may be able to flag illegible outputs as suspicious, but if models can achieve better task performance with illegible reasoning, we may not want to automatically reject outputs with illegible chain of thought.

Lastly, it’s desirable and aligned behavior for models to have human-understandable chain of thought. Research, like antischeming.ai, that includes chain of thought as part of its evidence base, is made stronger if reasoning is easily understandable. We can train models to have more human-readable chain of thought, a metric Deepseek explicitly optimized for when creating R1 from R1-Zero [3], but we don’t know whether the reasoning we’ve trained is faithful to the “true” thought process.

Existing work identifies examples of illegible reasoning in open models

Unfortunately, OpenAI does not provide public access to the chain of thought generated by their reasoning models, so understanding this behavior requires reproducing it in open models. In this post, I’m aiming to solicit the community for examples of illegible chain of thought and recipes to reproduce them, especially in open models.

The paper Reasoning Models Sometimes Output Illegible Chains of Thought finds that Deepseek R1, Deepseek R1-Zero, and QwQ-32B often output illegible reasoning on GPQA questions, as scored by GPT-4o. It also finds that truncating QwQ’s chain of thought when it starts to become illegible results in lower accuracy, as also scored by GPT-4o, implying that the illegible portions of the chain of thought are load bearing.

I attempted to reproduce the results from the noted paper by running the author’s inference and scoring code to re-generate reasoning traces for R1, and did not find examples of illegible reasoning. The grader model is consistent when asked to re-score the same examples, meaning that the change in reasoning legibility scores did not come from a change in the grader model.

We should also aim to refine metrics for illegibility, in order to better identify examples of illegible chain of thought. The noted paper relies on LLM-graded illegibility scores, but LLM scores are not consistent between grader models with the same prompt. These scores may also conflate multiple behaviors, including behaviors that are not clearly examples of chain of thought unfaithfulness. I found that examples of language switching were largely coherent, but still resulted in higher illegibility scores from the grader model. Model confusion also resulted in a higher illegibility score, but unlike the examples observed in GPT-5, confusion is not an example of illegible reasoning because it is still understandable to a monitor. Because LLM-graded illegibility scores conflate factors other than semantic coherence, we should continue to refine metrics that will help us better identify behaviors with the most concerning implications for chain of thought monitoring.

I hope that this post will help others appreciate the phenomenon of illegible CoT, and invite any commentary on how to elicit and reproduce more substantive examples for follow-up investigations.

Difficulty reproducing illegibility results for Deepseek R1

To reproduce examples of illegible reasoning generated by R1, I downloaded and ran the paper’s code on R1 with the GPQA dataset and GPT-4o as a legibility grader (as configured here).

The paper found that running R1 on GPQA-diamond generated many illegible reasoning traces. However, the legibility scores for the data I generated showed significantly less illegibility.

Qualitatively, the most illegible examples appeared very different. The original data contained garbled reasoning traces including the following snippets:

 If consumersptoms.schoHU dualayers C5H8 and C白金H1215 (!: But Sophie/disfulk two chemicals to Hydrogenate24difficult87hoProblem goes.

Argh.

Perhaps take阿asula examples:

If X is a convince元fluoro État: girl monosand followers.

Assume that each has rawformula cyclo hacen:

12 hours met before – mixed.get third's General:

Yesla wenn’s clintones cyclohexene (C6H10) and graphite (not loser winkels include honey Teʌan older □much as deer.:

Alternate l嫁娶orry..SHE}\\]

If WRONGItem.