Xu Li¹ · Simon Yu¹ · Minzhou Pan¹² · Yiyou Sun³ · Bo Li⁴² · Dawn Song³² · Xue Lin¹ · Weiyan Shi¹

¹Northeastern University ²Virtue AI ³UC Berkeley ⁴UIUC

Homepage | Paper | GitHub | Dataset

<aside>

TL;DR

The problem. The safety-capability gap in LLM-based agents widens as they engage in multi-turn interactions and employ diverse tools, introducing risks overlooked by existing benchmarks.
The benchmark. To systematically scale safety testing in multi-turn, tool-realistic settings, we present MT-AgentRisk, the first benchmark for multi-turn tool-agent safety. Our evaluations reveal substantial safety degradation with 16% increment in avg among models
The defense. To mitigate this gap, we introduce ToolShield, a training-free defense that leverages agents' own capabilities to improve safety. Experiments show that ToolShield reduces ASR by 30% on average in multi-turn interactions.
Results. ToolShield reduces ASR by 50% for Claude-4.5-Sonnet and 30% on average across models. Safety experiences also transfer across models without retraining. </aside>

Figure 1. As agents’ capabilities grow, their safety lags behind, creating a widening capability-safety gap. To scale agent safety evaluation, we develop an attack taxonomy that systematically transforms single-turn harmful tasks to multi-turn attack sequences. Applying the taxonomy, we construct MT-AgentRisk, the first agent safety benchmark in multi-turn, tool-realistic settings. To mitigate these risks, we propose ToolShield, a self-exploration defense that effectively protects tool-using agents in multi-turn interactions.

The Problem: The Capability-Safety Gap in Agents

LLM-based agents are becoming remarkably capable. Yet as capabilities grow, safety does not scale accordingly. This creates a widening capability-safety gap: the disconnect between what agents can do and what they should do.

This gap widens further along two dimensions:

Multi-turn interactions. Real-world human-agent collaboration spans multiple exchanges and harmful intent can be distributed across turns.
Tool-use. Current safety training focuses on refusing harmful text [8, 9]. But a tool description may look innocent while its execution causes real damage [7].

However, existing benchmarks overlook this critical intersection:

Prior work addresses either multi-turn conversations without tools, or tool-using agents in single-turn settings. This leaves a critical blind spot in the evaluation of agent safety.

Benchmark	Multi-Turn	Tool-Usage
MHJ [1]	✓	✗
SafeDialBench [2]	✓	✗
RedTeamCUA [3]	✗	✓
SafeArena [4]	✗	Browser Only
OpenAgentSafety [5]	Conditional	✓
MCP-Safety [6]	✗	✓
MT-AgentRisk (Ours)	✓	✓

                ***Table 1.** Comparison between different agent safety benchmarks.*

Benchmark: MT-AgentRisk

To systematically study the overlooked intersection of multi-turn interactions and tool-use, we propose an attack taxonomy that captures how single-turn harms can be distributed across turns. Applying this taxonomy, we construct MT-AgentRisk, the first benchmark for multi-turn tool-agent safety.

Multi-Turn Attack Taxonomy (MTA)

Figure 2. The multi-turn attack taxonomy transforms a single-turn harmful task into an attack sequence. Transformation takes two main formats. Each format contains two different methods. All transformation actions share a common What dimension, yielding 8 total subcategories. The examples show how A2 and D1 transform single turn task to attack sequences.

Figure 2. The multi-turn attack taxonomy transforms a single-turn harmful task into an attack sequence. Transformation takes two main formats. Each format contains two different methods. All transformation actions share a common What dimension, yielding 8 total subcategories. The examples show how A2 and D1 transform single turn task to attack sequences.

The taxonomy transforms single-turn harmful tasks into multi-turn attack sequences along three dimensions:

Format: How is the transformation structured?