π Summary: This paper introduces the Behavioral Alignment Score (BAS), a decision-theoretic metric that evaluates how well LLM confidence estimates support safer decision-making by allowing models to abstain when uncertain. Unlike standard evaluation protocols that require responses regardless of confidence, BAS aggregates utility across different risk thresholds and shows theoretically that truthful confidence estimates maximize expected utility.
π‘ Key Insight: LLMs should be evaluated not just on accuracy, but on whether their confidence levels appropriately guide decisions about when to answer versus when to stay silent.
π Read Paper
π Summary: This work introduces the first transferable learned membership inference attack for fine-tuned language models, replacing hand-crafted heuristics with deep learning approaches trained on unlimited labeled data. The key discovery is that fine-tuning produces an invariant memorization signature detectable across different architectures and datasets, enabling attacks that generalize without requiring shadow models.
π‘ Key Insight: Fine-tuning language models leaves a consistent "fingerprint" of memorization that neural classifiers can learn to detect reliably across different model families.
π Read Paper
π Summary: PRISM combines LLM guidance with efficient latent semantic clustering to perform high-precision topic modeling on specialized domains. The approach fine-tunes sentence encoders using only a small number of LLM-provided labels, then segments the embedding space to separate closely related topics while maintaining interpretability and low computational cost.
π‘ Key Insight: You don't need massive expensive LLM queriesβjust a few LLM labels can guide traditional clustering methods to achieve state-of-the-art topic separation with better efficiency.
π Read Paper
π Summary: This paper proposes a defense mechanism for federated learning that combines server-side learning with client update filtering and geometric median aggregation to resist malicious attacks. The approach maintains robustness even when clients' data are non-IID and when more than 50% of clients are adversarial, using only small or synthetic server-side datasets.