🤖 AI Research Digest – 2026-04-12

LLM

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

📄 Summary: This paper reviews how modern LLM agents are built by externalizing capabilities into memory stores, reusable skills, interaction protocols, and infrastructure rather than modifying model weights directly. The work argues that this shift transforms difficult cognitive tasks into forms that models can solve more reliably, using cognitive artifacts as a unifying framework.

💡 Key Insight: The smartest AI agents aren't built by training better models, but by building better external infrastructure around them.

🔗 Read Paper

Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

📄 Summary: Combee addresses the scalability challenge of prompt learning methods for LLM agents by enabling parallel execution across multiple agent traces without quality degradation. The method efficiently learns system prompts at scale by handling the synchronization challenges that arise when learning from many concurrent agentic executions.

💡 Key Insight: Prompt learning for agents scales best when you parallelize learning across many agent runs rather than sequential single-agent improvement.

🔗 Read Paper

Small Vision-Language Models are Smart Compressors for Long Video Understanding

📄 Summary: Tempo proposes using small vision-language models as efficient temporal compressors to adapt multimodal LLMs for hour-long videos while respecting token limits. The approach performs intent-aligned compression in a single forward pass and uses adaptive token allocation to strictly enforce budgets without breaking temporal causality.

💡 Key Insight: Small AI models can act as intelligent filters, keeping only the video moments that matter for understanding rather than blindly sampling frames.

🔗 Read Paper

ML

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

📄 Summary: This paper challenges the narrative that supervised fine-tuning only memorizes while reinforcement learning generalizes, showing that reasoning SFT can generalize cross-domain when optimized properly. The work reveals a "dip-and-recovery" pattern where performance temporarily degrades before improving, and demonstrates that data quality and model capability jointly determine generalization success.

💡 Key Insight: SFT failure often comes from stopping training too early—the performance looks bad mid-training but recovers stronger later.

🔗 Read Paper