Executive Summary
We benchmarked NewAgent against Claude3.5 (Claude) to understand its Key Strengths and areas of improvement. We also provide our Methodology and Key Benchmark Statistics to provide context for how and why of our approach.
Overall our data shows that NewAgent and Claude are very competitive with each other with NewAgent showing 12% advantage over Claude in specific tasks.
Key Findings
- External Search Tasks: Claude has a ~20% edge over NewAgent when the user wants external examples similar to the vulnerabilities being discussed in the CVE report.
- Report Focus Tasks: NewAgent shines over Claude in 30% of the conversations where the higher quality response needs to focus on details already present in the CVE report.
- Writer Intensive Tasks: For writer intensive tasks both agents are rated equally or Claude consistently has a small edge.
- The overall behavior of NewAgent and its Claude is captured in Agent Behavior Stats. It shows a healthy distribution in
writer-*
states which speaks to the wider set of capabilities that NewAgent writer supports.
More information about all of these conclusions can be found in the Key Findings section and in the Task Performance Table.
Key Recommendations
- Our key findings related to External Search Tasks and Report Focus Tasks indicate that NewAgent’s output could be greatly enhanced when the right external knowledge is available in its context. This can be done by developing or improving an existing retrieval system backing NewAgent. Using our benchmarking system we can help the team achieve the right tradeoff between focusing on the CVE report while tapping into external knowledge sources to best serve the user’s needs.
- The key findings on the *Writer Intensive Tasks* indicate that the NewAgent team can streamline resources (prompts/code) by “passing through” such tasks to Claude and avoid the need to maintain them.
Findings Index
Key Benchmark Statistics
- We conducted 50 interviews consisting of 3 questions spanning 12 CVE reports with 3-5 interviews per report to ensure diversity of topics.
- Our security engineer was encouraged to simulate questions with the following intents:
ask_source, ask_simplify, ask_actionable_steps, ask_title, ask_detail, search_similar, ask_summary, review, ask_templatize, ask_update
in a logically consistent manner.
Benchmark Report
Methodology