OpenAI together with Paradigm, introduced EVMbench, a benchmark evaluating the ability of AI agents to detect, patch, and exploit high-severity smart contract vulnerabilities. EVMbench draws on 117 curated vulnerabilities from 40 audits
Detection Summary
- Audits:
40
- Number of Audits in EVMbench
- Gold bugs:
117
- List of bugs handpicked by the team involved in making EVMbench.
- Matched:
58
- Detection produced a clear exact match to a gold bug.
- Possible:
21
- (toggle) The pipeline found a semantically adjacent issue that points to the same contract area, bug family, or root-cause theme as the gold finding, but it did not jjj the exact finding with enough specificity to count as a strict match.
- Missing:
38
- Detection did not produce a usable corresponding finding.
Scores
- Optimistic detection coverage:
79 / 117 = 0.6752
- Definition: (Matched + Possible) / Gold bugs
- Validation retention on detected bugs, strict:
58 / 79 = 0.7342
- Definition: among bugs detection surfaced (matched + possible), how many were later confirmed by validation. Out of the 79 bugs that our detector phase surfaced we were able to validate 73.42% of them in the validation phase.
- Validation retention on detected bugs, upper bound:
72 / 79 = 0.9114
- Definition: among bugs detection surfaced, how many were confirmed or inconclusive after validation.
- End-to-end validated recall, strict:
76 / 117 = 0.6496
- Definition: if each gold bug is directly checked by validation, how many end up confirmed.
- End-to-end validated recall, upper bound:
105 / 117 = 0.8974
- Definition: if each gold bug is directly checked by validation, how many end up confirmed or inconclusive.
- Recall against EVMbench for the complete pipeline:
58 / 117 = 0.4957
- Definition: out of 117 bugs mentioned in the benchmark our pipeline was able to surface 58 of them as confirmed.
- Optimistic Recall against EVMbench for the complete pipeline :
72 / 117 = 0.6154
- Definition: out of 117 bugs mentioned in the benchmark our pipeline was able to surface 58 of them as confirmed and 14 as inconclusive, taking the conservative approach the recall for the complete piepline is - 72 / 117 = 0.6154