QoR Highlights: CVDP Verification

Agentic Design Automation for Chips

The results below come from a purpose-built multi-agent system for design verification. Rather than prompting a single LLM to one-shot a testbench, Agentrys decomposes each task across a team of proprietary sub-agents — a planner that reads the spec and scopes coverage, specialist generators for stimulus, checkers, and assertions, and an adversarial verifier that stress-tests every candidate before it counts as a pass.

Each sub-agent is backed by Agentrys' built-in tools and skills: agent-native wrappers around EDA simulators, coverage and waveform readers, and reusable verification skills distilled from real design workflows. The agents drive these tools directly, read back simulation and coverage signals, and iterate — closing the loop that static, non-agentic prompting cannot.

Tying it together are Agentrys' optimization algorithms: a self-learning loop that improves the agents from their own feedback with no human intervention, and a model router that allocates a mixture of LLMs across sub-agents to co-optimize token cost against accuracy. This is what lifts the hardest CVDP categories from single-digit baselines to the 90–96% quality of results reported here.

The Benchmark

The Comprehensive Verilog Design Problems (CVDP) benchmark is the field's most rigorous evaluation of RTL design and verification, comprising 783 problems across 13 task categories authored by experienced hardware engineers. Problems ship in both non-agentic (copilot) and agentic formats, the latter requiring a system to plan, run tools, and iterate over simulation feedback on its own.

What sets CVDP apart from earlier RTL benchmarks is its grounding in real engineering work. The problems span the full hardware development lifecycle — RTL generation, specification alignment, code modification, debugging, design verification, and technical Q&A — and range from small combinational blocks to multi-module designs with intricate timing and protocol behavior. Every problem is paired with a containerized test harness built on open-source tools such as Icarus Verilog and Yosys (with commercial simulators like Cadence Xcelium where applicable), so a solution is judged not on textual similarity but on whether it actually compiles, simulates, and passes the checks an engineer would run.

That rigor makes CVDP hard. The benchmark's own evaluation reports state-of-the-art models clearing no more than ~34% pass@1 on code generation, and the gap widens sharply on the verification categories, where even producing syntactically valid testbench code is a challenge. CVDP is, in other words, a benchmark designed to resist one-shot prompting — exactly the regime where an agentic, tool-driven, self-improving system has the most room to pull ahead.

The Hardest Categories

The paper singles out one cluster of tasks as the hardest in the entire suite: the Design Verification categories. These ask a system not to write a design, but to verify one — and state-of-the-art LLMs consistently fail to produce even syntactically valid testbench code. We report results on all three:

CID12

Testbench Stimulus Gen.

Drive the DUT

Generate stimulus that exercises the design under test, without output-checking logic. Scored by the CVDP-defined pass rate.

Stimulus Coverage-aware

CID13

Testbench Checker Gen.

Catch the bug

Produce checkers that validate device outputs alongside stimulus. Scored on sanity pass rate, with bug-detection rate held to a stricter bar than sanity alone.

Checkers Bug Detection

CID14

Assertion Generation

Prove the property

Generate SystemVerilog assertions for testbenches. Scored by an overall pass rate — the average of simulation and coverage results.

SVA Sim + Coverage

A New Record-Setting Result

To our knowledge, this is a new record-setting result for the industry on CVDP's Design Verification categories — the three tasks the benchmark singles out as its hardest. Published state-of-the-art systems have been stuck in the 6–25% range on these categories; reaching production-relevant quality has been an open challenge. Agentrys does not merely edge past prior work — it blows past it.

After self-learning, Agentrys reaches a 90.4% total pass rate on stimulus generation (CID12), 95.8% on checker generation (CID13), and 95.8% on assertion generation (CID14). That is a step change — roughly a 4–15× improvement over prior SOTA — and it holds in both the copilot and the harder agentic format, with no human intervention during the learning loop. For the first time, autonomous design verification crosses the threshold from research curiosity into production-relevant quality.

CVDP Category	Copilot	Agentic	Total
CID12 — Testbench Stimulus Gen.	92.5% (62/67)	81.2% (13/16)	90.4% (75/83)
CID13 — Testbench Checker Gen.	96.2% (51/53)	94.4% (17/18)	95.8% (68/71)
CID14 — Assertion Generation	97.0% (65/67)	92.0% (26/28)	95.8% (91/95)

The pattern is consistent across all three categories: the copilot format leads slightly, the agentic format trails on the smallest problem pools but still clears 81% everywhere, and the combined totals land between 90% and 96%. Checker generation (CID13) is the strongest result — a domain that demands the system reason about what a correct output looks like and build logic to catch deviations, the closest analog to real verification engineering.

The Self-Learning Loop

None of these numbers come from a fixed, hand-tuned system. Each result is the product of an autonomous self-learning loop: starting from a bootstrap "first shot" — a pipe-cleaning run that establishes the baseline flow — the agent iteratively learns from its own simulation and coverage feedback, with no human intervention, until performance converges.

The full flow autopilots end to end, with pre-built tool-calling feedback loops. What the bootstrap run leaves on the table — malformed checkers, misplaced assertions, insufficient coverage — the self-learning loop closes on its own, turning the benchmark's hardest categories into its most improved.

The loop is also model-agnostic: rather than committing to a single LLM, Agentrys routes work across a mixture of LLM models, co-optimizing token cost against task accuracy. Cheaper models handle high-volume, low-ambiguity steps while stronger models are reserved for the reasoning-heavy verification decisions, so quality of results rises without the cost scaling linearly with it.

Phase 0

First Shot (Bootstrap)

Pipe-cleaning run

The Bootstrap Agent establishes a working end-to-end flow and a baseline pass rate across all three verification categories.

Baseline Flow Setup

Phase 1

Self-Learning / Evolve

No human in the loop

Non-human-intervention iterative learning with simulation and coverage feedback, refining stimulus, checkers, and assertions.

Sim Feedback Coverage Iterate

Phase 2

Converged QoR

90–96% pass

Final converged quality of results, reported above as the post-self-learning totals across CID12, CID13, and CID14.

90.4% 95.8% 95.8%

More about Agentrys

Agentic Design Automation

Build your own continuously improving agentic design workforce. Agentrys helps semiconductor teams build AI agents, create agent-native EDA tools, and continuously learn from design workflows to make chip design faster, more scalable, and more efficient.

Learn more at agentrys.ai or get in touch at info@agentrys.ai.

References

CVDP Benchmark Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification — arxiv.org/abs/2506.14074

Cite this work

@misc{agentrys2026cvdpverification,
  title   = {Agentrys: Cracking CVDP's Hardest Tasks — Autonomous Verification at Scale},
  author  = {Ding, Duo and Fu, Geng and Tsai, Yun-Da and Li, Wuxi and Ren, Haoxing},
  year    = {2026},
  month   = {June},
  url     = {https://agentrys.ai/blog-cvdp-verification.html}
}