πŸ€– AI Research Digest – 2026-04-10

LLM

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

πŸ“„ Summary: This paper identifies a critical failure mode in multimodal MoE models where they accurately perceive images but fail at reasoning tasks that they can solve when presented as text. The authors reveal that visual inputs cause routing divergence toward different expert clusters than text inputs, particularly in middle layers, proposing the "Routing Distraction" hypothesis as the underlying mechanism.

πŸ’‘ Key Insight: Multimodal models can "see" content correctly but get confused about which experts to use, routing their reasoning through the wrong specialized modules.

πŸ”— Read Paper


AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

πŸ“„ Summary: This benchmark addresses the fragmented evaluation landscape for text-to-audio-video generation by proposing AVGen-Bench with 11 real-world task categories and a multi-granular evaluation framework combining specialist models and MLLMs. It reveals significant gaps between strong aesthetic quality and semantic controllability in current systems.

πŸ’‘ Key Insight: Current audio-visual generators look good but often don't follow detailed instructionsβ€”we need better ways to measure semantic alignment alongside aesthetics.

πŸ”— Read Paper


Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

πŸ“„ Summary: This paper identifies a critical training instability in On-Policy Distillation where student models generate increasingly longer, repetitive rollouts that corrupt training data, causing sharp performance degradation. The authors propose StableOPD, which uses reference-based divergence constraints to stabilize training and prevent this collapse.

πŸ’‘ Key Insight: When training students to mimic teachers, models can get stuck in a death spiral of generating longer and longer repetitive outputs, poisoning their own training data.

πŸ”— Read Paper


ML

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

πŸ“„ Summary: This paper presents a meta-learned approach for visual decoding from fMRI brain signals that generalizes to new subjects without any fine-tuning, using only a small set of image-brain activation examples for in-context adaptation. This addresses the major obstacle of substantial neural representation variability across individuals.