Disclaimer: SpendNode is for informational purposes only and is not a financial advisor. Some links on this site are affiliate links - we may earn a commission at no extra cost to you. This does not affect our data or rankings. Affiliate DisclosureView Policy
Crypto News

OpenAI and Paradigm Launch EVMbench as AI Exploits 70 Percent of Critical Smart Contract Bugs That Once Stumped Every Model

Updated: Feb 18, 2026By SpendNode Editorial
DisclaimerThis article is provided for informational purposes only and does not constitute financial advice. All fee, limit, and reward data is based on issuer-published documentation as of the date of verification.

Key Analysis

EVMbench benchmarks AI agents on 120 real smart contract vulnerabilities. GPT-5.3-Codex now exploits over 70% of critical bugs, up from under 20%.

OpenAI and Paradigm Launch EVMbench as AI Exploits 70 Percent of Critical Smart Contract Bugs That Once Stumped Every Model

The biggest name in artificial intelligence and the most influential venture firm in crypto just built a tool that measures how well AI can break into smart contracts, and the results should make every protocol team reconsider their security stack.

OpenAI and Paradigm released EVMbench on February 18, 2026, an open evaluation framework that tests AI agents across three dimensions: detecting vulnerabilities, patching them, and exploiting them for profit. The benchmark draws on 120 curated vulnerabilities from 40 completed audits, most sourced from Code4rena competitive audit competitions. When the project started, the best AI models could exploit fewer than 20% of critical, fund-draining bugs. As of the launch, GPT-5.3-Codex cracks over 70%.

From Under 20% to 70%: The Exploit Rate That Changes Everything

The headline number is hard to ignore. A 3.5x improvement in AI exploit capability did not happen over years. It happened over the course of a single model generation.

EVMbench's vulnerability dataset is not synthetic. These are real bugs that cost real money in real protocols. Code4rena audits involve over 100 security researchers competing to find flaws in production-ready codebases. The vulnerabilities EVMbench curates from these competitions represent the hardest class of smart contract bugs: the ones sophisticated human auditors flag as critical and fund-draining.

When OpenAI and Paradigm began calibrating EVMbench, top-tier language models struggled with these bugs. Fewer than one in five critical vulnerabilities could be exploited by the best available AI. Today, GPT-5.3-Codex exploits more than seven out of ten. The benchmark also includes vulnerability scenarios from the Tempo blockchain, a purpose-built Layer 1 designed for high-throughput stablecoin payments, extending the evaluation into payment-specific smart contract code.

The progression signals something the crypto security industry has debated for years: AI is no longer a supplementary tool for auditors. It is approaching the capability threshold where it can match, and in some cases outperform, professional human review on specific vulnerability classes.

What EVMbench Actually Tests

The benchmark operates in three modes, each measuring a different facet of AI security capability.

Exploit mode asks the agent to identify a vulnerability and write working proof-of-concept exploit code. This is the most demanding task. The agent must understand the contract's logic, identify the flaw, and construct a transaction sequence that drains funds or manipulates state. Custom graders verify exploit success, and the team red-teamed environments to prevent agents from cheating.

Detect mode evaluates whether an agent can identify the vulnerability and describe it accurately, without necessarily writing exploit code. This mirrors the workflow of a human security researcher conducting a preliminary review.

Patch mode tests defensive capability. Given a vulnerable contract, the agent must produce a fix that eliminates the vulnerability without introducing breaking changes to the contract's intended functionality. This is arguably the most commercially valuable mode, because it maps directly to what protocol teams need after a bug is found.

Each task runs in a containerized environment with verified answer keys confirming solvability. Task quality control came from Paradigm's domain expertise, supplemented by automated auditing agents that cross-checked environment soundness.

OtterSec, a well-known Web3 security firm, contributed frontend support for the auditing agent interface that extends the benchmark harness into a usable tool.

$100 Billion in Open-Source Contracts Now Has a New Adversary

Paradigm frames the motivation bluntly: over $100 billion in assets sits in open-source crypto contracts. Every line of that code is visible to anyone, including AI agents that can now process and reason about it at scale.

The security implications cut both ways. On the defensive side, protocol teams can now run AI auditing agents as a first pass before committing to expensive manual audits. A comprehensive manual audit from a top-tier firm can cost $200,000 to $500,000 and take weeks. An AI agent running EVMbench-calibrated models can scan the same codebase in hours, flagging the highest-risk patterns for human review.

On the offensive side, the same capability is available to attackers. The Moonwell oracle exploit that drained $1.78 million in four minutes was triggered by a single misconfigured price feed. An AI agent with EVMbench-level exploit capability could theoretically scan hundreds of protocols simultaneously, identifying similar misconfigurations before any human notices.

The crypto industry lost approximately $17 billion to scams and hacks in 2025. While most losses stemmed from social engineering and operational failures rather than on-chain code exploits, smart contract vulnerabilities still contributed hundreds of millions in damages. DeFi-specific exploit losses reached $3.1 billion in the first half of 2025 alone.

AI that exploits 70% of critical bugs does not mean 70% of all DeFi is vulnerable. EVMbench tests against known, curated vulnerabilities. But it does mean the bar for what constitutes "good enough" security just rose dramatically.

What This Means for DeFi Protocols and Their Users

The practical impact lands on two groups: the protocols that build smart contracts and the users who trust them with funds.

For protocols, EVMbench creates a new baseline expectation. If an AI agent can find and exploit a bug in your contract, and that agent is freely available as an open benchmark, shipping without AI-augmented auditing becomes negligence rather than oversight. The authors state explicitly that "a growing portion of audits in the future will be done by agents."

For users, particularly those who interact with DeFi through self-custody wallets and cards, the development is mixed. Better AI auditing tools should reduce the frequency of exploits in well-maintained protocols. But the same capability in adversarial hands raises the ceiling on how quickly new vulnerabilities can be discovered and exploited after deployment.

Users of crypto cards that connect to DeFi protocols, such as ether.fi cards that tap into restaking yields or Gnosis Pay cards that spend directly from on-chain balances, have direct exposure to smart contract risk. Stronger AI auditing tools reduce that risk, but only if the protocols they depend on actually adopt them.

The timing also matters. EVMbench launches during a period when January 2026 alone saw approximately $370 million in crypto losses, with phishing accounting for 84% of the total. Smart contract exploits remain a smaller but high-impact slice of the threat landscape, and AI auditing could meaningfully compress that slice.

The Audit Industry Faces Its SWE-bench Moment

EVMbench is to smart contract security what SWE-bench was to software engineering: a standardized, open benchmark that forces the industry to confront exactly how capable AI has become.

SWE-bench, also built by OpenAI, measured how well AI could solve real GitHub issues. It accelerated the adoption of AI coding assistants across the software industry by providing undeniable, reproducible evidence of capability. EVMbench is designed to do the same for smart contract security.

The implications for the existing audit industry are significant. Firms like Trail of Bits, OpenZeppelin, Halborn, and Consensys Diligence charge premium rates for manual review. If AI agents can replicate 70% of the exploit-finding capability of a human audit at a fraction of the cost and time, the business model shifts from pure manual review toward human-AI hybrid workflows.

This does not mean human auditors become irrelevant. The remaining 30% of critical vulnerabilities that AI cannot yet exploit likely represent the most complex, context-dependent bugs, exactly the kind that require deep protocol-specific knowledge and creative reasoning. But the 70% that AI can handle represents the bulk of routine vulnerability classes, and automating that layer frees human auditors to focus on the hardest problems.

FAQ

What is EVMbench? EVMbench is an open evaluation framework built by OpenAI and Paradigm that tests AI agents on their ability to detect, patch, and exploit real smart contract vulnerabilities. It includes 120 curated vulnerabilities from 40 audits, primarily sourced from Code4rena competitions.

How well does AI perform on EVMbench? GPT-5.3-Codex, OpenAI's latest code-focused model, can exploit over 70% of critical, fund-draining smart contract bugs in the benchmark. When the project began, the best models exploited fewer than 20%.

Does EVMbench make DeFi less safe? The benchmark is designed to improve security by giving defenders the same tools as potential attackers. By open-sourcing the framework, OpenAI and Paradigm enable protocol teams to test their contracts against the same AI capabilities that adversaries could deploy.

Can I use EVMbench to audit my own contracts? The benchmark harness has been extended into an auditing agent accessible at paradigm.xyz/evmbench, with frontend support from OtterSec. Protocol teams can use it as a first-pass security review.

Overview

OpenAI and Paradigm's EVMbench represents the first standardized benchmark for measuring AI capability in smart contract security. The jump from under 20% to over 70% exploit success rate on critical Code4rena vulnerabilities demonstrates that AI auditing has crossed a practical threshold. With $100 billion in open-source smart contract value at stake, the release pushes the crypto security industry toward AI-augmented auditing as a baseline expectation rather than a luxury. For DeFi users and crypto card holders who depend on smart contract integrity, the development is a net positive: better tools for defenders, provided protocols actually adopt them.

Recommended Reading

Sources

Have a question or update?

Discuss this analysis with the community on X.

Discuss on X

Comments

Comments are moderated and may take a moment to appear.

Loading comments...