Quantifying LLM Evaluation Uncertainty

Repo

https://github.com/ads2280/llm-eval-uncertainty

Results: https://llm-eval-uncertainty.streamlit.app/

Summary of results

Traditional evaluations provide point estimates but don't convey confidence levels. I implemented conformal prediction to generate prediction intervals that quantify uncertainty in code correctness evals.

The conformal prediction system achieved 70.0% coverage with an average interval width of ±0.48, correctly identifying 30% of predictions as uncertain and requiring human review.

The LLM judge shows highest uncertainty for mid-range scores (0.4-0.7), which correspond to borderline code correctness cases where even human evaluators would likely disagree, proving that conformal prediction automatically identifies the most ambiguous evaluation decisions for human oversight.

My understanding of conformal prediction

I implemented conformal prediction as a mathematically rigorous approach to address a fundamental problem in production AI systems: how do you know when to trust an AI judgment versus when to route it for human review? It’s primary selling point, to me, is the distribution-free coverage guarantee - regardless of the underlying model architecture or data distribution, I can mathematically guarantee that X% of my prediction intervals will contain the true value. In my code correctness evaluation, this translated to transforming point estimates like "0.95 correct" into honest uncertainty ranges like "[0.53, 1.0] with 85% confidence."

I think what makes this very interesting for production systems is the automatic decision boundary detection: when prediction intervals cross critical thresholds (like the 0.5 correctness cutoff I was working with), the system flags these as uncertain cases requiring human review, while confidently processing more clear-cut examples. In my implementation, this resulted in 70%-87% empirical coverage and correctly identified 30.0% of borderline cases as uncertain.

In my repo I experiment with three conformal prediction strategies to optimize for practical deployment: standard fixed-width intervals (conservative but reliable), adaptive widths based on prediction confidence (balancing safety with utility), and an aggressive approach with tighter intervals and hard caps for confident predictions. Happy to talk through each/any of these approaches in more depth on Tuesday!

Comparing results with/without conformal prediction

without conformal, just llm-as-judge:

no indication of model confidence or uncertainty
treats 0.51 and 0.95 predictions identically (both "helpful")
no systematic way to identify cases that need human review