Introduction

We trained a code semantic highlighting model, SWE-Pruner, and open-sourced an accompanying agentic integration framework. On real-world multi-turn coding agent tasks (SWE-bench, SWE-QA), our solution reduces token consumption by over 30% in multi-turn scenarios while maintaining or even improving performance. The model leverages semantic understanding to automatically identify and highlight semantically relevant lines within retrieved documents. The framework serves as an agentic integration layer and can be easily plugged into state-of-the-art agent systems such as Claude Code and OpenHands.

Model Release:

In this article, we will present our technical approach.

The Problem: Context Bloat in Coding Agents

In prior RAG work, coarse-grained context bloat has already been a serious issue:

https://huggingface.co/blog/zilliz/zilliz-semantic-highlight-model

In production RAG systems, a typical query retrieves 10 documents, each containing thousands of tokens—resulting in tens of thousands of tokens per query. The problem is: only a few dozen sentences actually contain relevant information; the rest is noise, which increases cost and degrades answer quality.

However, in real-world software engineering tasks, the problem is even more severe: a single query may involve hundreds of code files, and in large projects, individual files can contain far more than just a few thousand tokens. Yet modern coding agents still widely rely on keyword-matching approaches like grep to explore massive codebases. The resulting excessive match volume forces most agents to implement aggressive hard context truncation logic.

image.png

In AI agent scenarios, this issue becomes even worse because queries are complex instructions generated through reasoning and decomposition. Traditional highlighting methods merely mechanically mark matched keywords but miss truly valuable analytical insights.