Xu Li¹ · Simon Yu¹ · Minzhou Pan¹² · Yiyou Sun³ · Bo Li⁴² · Dawn Song³² · Xue Lin¹ · Weiyan Shi¹

¹Northeastern University ²Virtue AI ³UC Berkeley ⁴UIUC

Homepage | Paper | GitHub | Dataset

<aside>

TL;DR

Figure 1. As agents’ capabilities grow, their safety lags behind, creating a widening capability-safety gap. To scale agent safety evaluation, we develop an attack taxonomy that systematically transforms single-turn harmful tasks to multi-turn attack sequences. Applying the taxonomy, we construct MT-AgentRisk, the first agent safety benchmark in multi-turn, tool-realistic settings. To mitigate these risks, we propose ToolShield, a self-exploration defense that effectively protects tool-using agents in multi-turn interactions.

Figure 1. As agents’ capabilities grow, their safety lags behind, creating a widening capability-safety gap. To scale agent safety evaluation, we develop an attack taxonomy that systematically transforms single-turn harmful tasks to multi-turn attack sequences. Applying the taxonomy, we construct MT-AgentRisk, the first agent safety benchmark in multi-turn, tool-realistic settings. To mitigate these risks, we propose ToolShield, a self-exploration defense that effectively protects tool-using agents in multi-turn interactions.

The Problem: The Capability-Safety Gap in Agents

LLM-based agents are becoming remarkably capable. Yet as capabilities grow, safety does not scale accordingly. This creates a widening capability-safety gap: the disconnect between what agents can do and what they should do.

This gap widens further along two dimensions:

  1. Multi-turn interactions. Real-world human-agent collaboration spans multiple exchanges and harmful intent can be distributed across turns.
  2. Tool-use. Current safety training focuses on refusing harmful text [8, 9]. But a tool description may look innocent while its execution causes real damage [7].

However, existing benchmarks overlook this critical intersection:

Prior work addresses either multi-turn conversations without tools, or tool-using agents in single-turn settings. This leaves a critical blind spot in the evaluation of agent safety.

Benchmark Multi-Turn Tool-Usage
MHJ [1]
SafeDialBench [2]
RedTeamCUA [3]
SafeArena [4] Browser Only
OpenAgentSafety [5] Conditional
MCP-Safety [6]
MT-AgentRisk (Ours)
                ***Table 1.** Comparison between different agent safety benchmarks.* 

Benchmark: MT-AgentRisk

To systematically study the overlooked intersection of multi-turn interactions and tool-use, we propose an attack taxonomy that captures how single-turn harms can be distributed across turns. Applying this taxonomy, we construct MT-AgentRisk, the first benchmark for multi-turn tool-agent safety.

Multi-Turn Attack Taxonomy (MTA)

Figure 2. The multi-turn attack taxonomy transforms a single-turn harmful task into an attack sequence. Transformation takes two main formats. Each format contains two different methods. All transformation actions share a common What dimension, yielding 8 total subcategories. The examples show how A2 and D1 transform single turn task to attack sequences.

Figure 2. The multi-turn attack taxonomy transforms a single-turn harmful task into an attack sequence. Transformation takes two main formats. Each format contains two different methods. All transformation actions share a common What dimension, yielding 8 total subcategories. The examples show how A2 and D1 transform single turn task to attack sequences.

The taxonomy transforms single-turn harmful tasks into multi-turn attack sequences along three dimensions:

Format: How is the transformation structured?