AI Agent Eval | Notion

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Introducing debug-gym

A natural research question emerges: to what degree can LLMs use interactive debugging tools such as pdb? To explore this question, we released debug-gym(opens in new tab) – an environment that allows code-repairing agents to access tools for active information-seeking behavior. Debug-gym expands an agent’s action and observation space with feedback from tool usage, enabling setting breakpoints, navigating code, printing variable values, and creating test functions. Agents can interact with tools to investigate code or rewrite it, if confident. We believe interactive debugging with proper tools can empower coding agents to tackle real-world software engineering tasks and is central to LLM-based agent research. The fixes proposed by a coding agent with debugging capabilities, and then approved by a human programmer, will be grounded in the context of the relevant codebase, program execution and documentation, rather than relying solely on guesses based on previously seen training data.

Figure 1: Diagram demonstrating the code-repairing process in outline. Left: conventional code-repairing system; right: additional tools enabled by debug-gym.

Figure 1: Diagram demonstrating the code-repairing process in outline. In most existing approaches (shown in black), an agent rewrites its code conditioned on error messages obtained from executing the code. debug-gym equips the agent with additional tools such as pdb (shown in red), so it can interactively seek necessary information from the semantic space hidden behind the code and therefore have better code-repairing performance.

Debug-gym is designed and developed to:

Handle repository-level information: the full repository is available to agents in debug-gym, allowing them to navigate and edit files.
Be robust and safe: to safeguard both the system and the development process, debug-gym runs code within sandbox Docker containers. This isolates the runtime environment, preventing harmful actions while still allowing thorough testing and debugging.
Be easily extensible: debug-gym was conceived with extensibility in mind and provides practitioners with the possibility of easily adding new tools.
Be text-based: debug-gym represents observation information in structured text (e.g., JSON format) and defines a simple syntax for text actions, making the environment fully compatible with modern LLM-based agents.

With debug-gym, researchers and developers can specify a folder path to work with any custom repository to evaluate their debugging agent’s performance. Additionally, debug-gym includes three coding benchmarks to measure LLM-based agents’ performance in interactive debugging: Aider for simple function-level code generation, Mini-nightmare for short, hand-crafted buggy code examples, and SWE-bench for real-world coding problems requiring a comprehensive understanding of a large codebase and a solution in the format of a GitHub pull request.

To learn more about debug-gym and start using it to train your own debugging agents, please refer to the technical report(opens in new tab) and GitHub(opens in new tab).

AI Testing and Evaluation: Learnings from Science and Industry

Introducing debug-gym

Future work