LLMS Know More than they Show

Key Contributions

Better Error Detection:
- show that truthfulness information is concentrated in specific answer tokens, and by training probing classifiers on these tokens, we significantly improve the ability to detect errors.
Generalization Challenges:
- While the method improves error detection within datasets, find that probing classifiers do not generalize across different tasks.Results indicate that LLMs encode multiple, distinct notions of truth.
Error Type Prediction:
- The internal representations can also be used to predict the type of error, enabling the development of targeted error mitigation strategies.
Behavior vs. Knowledge Discrepancy:
- uncover a contradiction between LLMs' internal knowledge and external outputs—they may encode the correct answer internally but still generate an incorrect one.

Traditional methods focus on user-perceptions and output-analysis to predict truthfulness and error-types

LLM’s internal states encode information regarding the *truthfulness* of their outputs
This information can be used to detect errors.

Where truthfulness is encoded?

Truthfulness info is concentrated in the exact answer tokens , not just the last token or full answer
- eg. Paris in Paris is the capital of France

Given the above,

Probing classifiers - Classifier trained on intermediate representations within the exact-answer-tokens

However,

Probing classifiers