Agentic Design Automation for Chips
The results below come from a purpose-built multi-agent system for design verification. Rather than prompting a single LLM to one-shot a testbench, Agentrys decomposes each task across a team of proprietary sub-agents — a planner that reads the spec and scopes coverage, specialist generators for stimulus, checkers, and assertions, and an adversarial verifier that stress-tests every candidate before it counts as a pass.
Each sub-agent is backed by Agentrys' built-in tools and skills: agent-native wrappers around EDA simulators, coverage and waveform readers, and reusable verification skills distilled from real design workflows. The agents drive these tools directly, read back simulation and coverage signals, and iterate — closing the loop that static, non-agentic prompting cannot.
Tying it together are Agentrys' optimization algorithms: a self-learning loop that improves the agents from their own feedback with no human intervention, and a model router that allocates a mixture of LLMs across sub-agents to co-optimize token cost against accuracy. This is what lifts the hardest CVDP categories from single-digit baselines to the 90–96% quality of results reported here.
The Benchmark
The Comprehensive Verilog Design Problems (CVDP) benchmark is the field's most rigorous evaluation of RTL design and verification, comprising 783 problems across 13 task categories authored by experienced hardware engineers. Problems ship in both non-agentic (copilot) and agentic formats, the latter requiring a system to plan, run tools, and iterate over simulation feedback on its own.
What sets CVDP apart from earlier RTL benchmarks is its grounding in real engineering work. The problems span the full hardware development lifecycle — RTL generation, specification alignment, code modification, debugging, design verification, and technical Q&A — and range from small combinational blocks to multi-module designs with intricate timing and protocol behavior. Every problem is paired with a containerized test harness built on open-source tools such as Icarus Verilog and Yosys (with commercial simulators like Cadence Xcelium where applicable), so a solution is judged not on textual similarity but on whether it actually compiles, simulates, and passes the checks an engineer would run.
That rigor makes CVDP hard. The benchmark's own evaluation reports state-of-the-art models clearing no more than ~34% pass@1 on code generation, and the gap widens sharply on the verification categories, where even producing syntactically valid testbench code is a challenge. CVDP is, in other words, a benchmark designed to resist one-shot prompting — exactly the regime where an agentic, tool-driven, self-improving system has the most room to pull ahead.
The Hardest Categories
The paper singles out one cluster of tasks as the hardest in the entire suite: the Design Verification categories. These ask a system not to write a design, but to verify one — and state-of-the-art LLMs consistently fail to produce even syntactically valid testbench code. We report results on all three:
A New Record-Setting Result
To our knowledge, this is a new record-setting result for the industry on CVDP's Design Verification categories — the three tasks the benchmark singles out as its hardest. Published state-of-the-art systems have been stuck in the 6–25% range on these categories; reaching production-relevant quality has been an open challenge. Agentrys does not merely edge past prior work — it blows past it.
After self-learning, Agentrys reaches a 90.4% total pass rate on stimulus generation (CID12), 95.8% on checker generation (CID13), and 95.8% on assertion generation (CID14). That is a step change — roughly a 4–15× improvement over prior SOTA — and it holds in both the copilot and the harder agentic format, with no human intervention during the learning loop. For the first time, autonomous design verification crosses the threshold from research curiosity into production-relevant quality.
| CVDP Category | Copilot | Agentic | Total |
|---|---|---|---|
| CID12 — Testbench Stimulus Gen. | 92.5% (62/67) | 81.2% (13/16) | 90.4% (75/83) |
| CID13 — Testbench Checker Gen. | 96.2% (51/53) | 94.4% (17/18) | 95.8% (68/71) |
| CID14 — Assertion Generation | 97.0% (65/67) | 92.0% (26/28) | 95.8% (91/95) |
The pattern is consistent across all three categories: the copilot format leads slightly, the agentic format trails on the smallest problem pools but still clears 81% everywhere, and the combined totals land between 90% and 96%. Checker generation (CID13) is the strongest result — a domain that demands the system reason about what a correct output looks like and build logic to catch deviations, the closest analog to real verification engineering.
The Self-Learning Loop
None of these numbers come from a fixed, hand-tuned system. Each result is the product of an autonomous self-learning loop: starting from a bootstrap "first shot" — a pipe-cleaning run that establishes the baseline flow — the agent iteratively learns from its own simulation and coverage feedback, with no human intervention, until performance converges.
The full flow autopilots end to end, with pre-built tool-calling feedback loops. What the bootstrap run leaves on the table — malformed checkers, misplaced assertions, insufficient coverage — the self-learning loop closes on its own, turning the benchmark's hardest categories into its most improved.
The loop is also model-agnostic: rather than committing to a single LLM, Agentrys routes work across a mixture of LLM models, co-optimizing token cost against task accuracy. Cheaper models handle high-volume, low-ambiguity steps while stronger models are reserved for the reasoning-heavy verification decisions, so quality of results rises without the cost scaling linearly with it.
More about Agentrys
Build your own continuously improving agentic design workforce. Agentrys helps semiconductor teams build AI agents, create agent-native EDA tools, and continuously learn from design workflows to make chip design faster, more scalable, and more efficient.
Learn more at agentrys.ai or get in touch at info@agentrys.ai.
References
- CVDP Benchmark Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification — arxiv.org/abs/2506.14074
Cite this work
@misc{agentrys2026cvdpverification,
title = {Agentrys: Cracking CVDP's Hardest Tasks — Autonomous Verification at Scale},
author = {Ding, Duo and Fu, Geng and Tsai, Yun-Da and Li, Wuxi and Ren, Haoxing},
year = {2026},
month = {June},
url = {https://agentrys.ai/blog-cvdp-verification.html}
}