LLMs Know More Than They Show: On the Intrinsic Representation of...
Key Contributions
- Better Error Detection:
- show that truthfulness information is concentrated in specific answer tokens, and by training
probing classifiers on these tokens, we significantly improve the ability to detect errors.
- Generalization Challenges:
- While the method improves error detection within datasets, find that probing classifiers do not generalize across different tasks.Results indicate that LLMs encode multiple, distinct notions of truth.
- Error Type Prediction:
- The internal representations can also be used to predict the type of error, enabling the development of targeted error mitigation strategies.
- Behavior vs. Knowledge Discrepancy:
- uncover a contradiction between LLMs' internal knowledge and external outputs—they may encode the correct answer internally but still generate an incorrect one.
Traditional methods focus on user-perceptions and output-analysis to predict truthfulness and error-types
- LLM’s internal states encode information regarding the
*truthfulness* of their outputs
- This information can be used to detect errors.
Where truthfulness is encoded?
Truthfulness info is concentrated in the exact answer tokens , not just the last token or full answer
- eg. Paris in Paris is the capital of France
Given the above,
Probing classifiers - Classifier trained on intermediate representations within the exact-answer-tokens
However,
Probing classifiers