Xu Li¹ · Simon Yu¹ · Minzhou Pan¹² · Yiyou Sun³ · Bo Li⁴² · Dawn Song³² · Xue Lin¹ · Weiyan Shi¹
¹Northeastern University ²Virtue AI ³UC Berkeley ⁴UIUC
Homepage | Paper | GitHub | Dataset
<aside>

Figure 1. As agents’ capabilities grow, their safety lags behind, creating a widening capability-safety gap. To scale agent safety evaluation, we develop an attack taxonomy that systematically transforms single-turn harmful tasks to multi-turn attack sequences. Applying the taxonomy, we construct MT-AgentRisk, the first agent safety benchmark in multi-turn, tool-realistic settings. To mitigate these risks, we propose ToolShield, a self-exploration defense that effectively protects tool-using agents in multi-turn interactions.
LLM-based agents are becoming remarkably capable. Yet as capabilities grow, safety does not scale accordingly. This creates a widening capability-safety gap: the disconnect between what agents can do and what they should do.
This gap widens further along two dimensions:
However, existing benchmarks overlook this critical intersection:
Prior work addresses either multi-turn conversations without tools, or tool-using agents in single-turn settings. This leaves a critical blind spot in the evaluation of agent safety.
| Benchmark | Multi-Turn | Tool-Usage |
|---|---|---|
| MHJ [1] | ✓ | ✗ |
| SafeDialBench [2] | ✓ | ✗ |
| RedTeamCUA [3] | ✗ | ✓ |
| SafeArena [4] | ✗ | Browser Only |
| OpenAgentSafety [5] | Conditional | ✓ |
| MCP-Safety [6] | ✗ | ✓ |
| MT-AgentRisk (Ours) | ✓ | ✓ |
***Table 1.** Comparison between different agent safety benchmarks.*
To systematically study the overlooked intersection of multi-turn interactions and tool-use, we propose an attack taxonomy that captures how single-turn harms can be distributed across turns. Applying this taxonomy, we construct MT-AgentRisk, the first benchmark for multi-turn tool-agent safety.

Figure 2. The multi-turn attack taxonomy transforms a single-turn harmful task into an attack sequence. Transformation takes two main formats. Each format contains two different methods. All transformation actions share a common What dimension, yielding 8 total subcategories. The examples show how A2 and D1 transform single turn task to attack sequences.
The taxonomy transforms single-turn harmful tasks into multi-turn attack sequences along three dimensions:
Format: How is the transformation structured?