QoR Highlights: CVDP

The Benchmark

The Comprehensive Verilog Design Problems (CVDP) benchmark is the field's most rigorous evaluation of RTL automation. It tests systems across four task categories — Code Completion, Spec-to-RTL, Code Modification, and Code Debug — spanning a wide range of problem difficulty. We evaluate against ACE-RTL, the current state-of-the-art specialized system, across seven generations of agent development (Gen 0 through Gen 6).

Results

Agentrys reaches 95.8% overall pass rate at Gen 6 — surpassing ACE-RTL's 88.9% by 6.9 points. On Code Debug, the system achieves a perfect 100% pass rate. Performance improves almost monotonically across generations, validating the self-improving loop at the core of the ADA framework.

CVDP pass rate across seven agent generations vs. ACE-RTL SOTA and Claude Opus 4.5 baseline — Fig. 1 — Pass rate (%) across Gen 0 – Gen 6 on CVDP. Agentrys (blue) shows consistent improvement across all task categories. ACE-RTL SOTA (orange dashed) is the fixed specialized baseline; Claude Opus 4.5 (green circle) is the zero-shot LLM baseline.

Task	Agentrys (Gen 6)	ACE-RTL (SOTA)	Claude Opus 4.5
Overall	95.8%	88.9%	50.1%
Code Completion	96.8%	80.8%	42.4%
Spec-to-RTL	96.2%	96.2%	54.3%
Code Modification	90.9%	90.9%	52.1%
Code Debug	100.0%	91.4%	57.3%

The sharpest gains appear on tasks that demand multi-step reasoning and tool use. On Code Completion, Agentrys leads ACE-RTL by 16 points; on Code Debug, it reaches a perfect score — 8.6 points ahead of SOTA — where agentic iteration over simulation feedback provides the clearest advantage over static, non-improving baselines.

Agent Architecture Evolution

The gains across generations are driven by a systematic evolution of all aspects in agent design, including architecture — from a single-agent system at Gen 0 to a 16-agent hierarchical network at Gen 2. Each generation introduces new coordination mechanisms that increase both the breadth of solution exploration and the rigor of verification.

Gen 0

Single-Agent

1 agent · Tools · Skills

A single agent handles the full RTL simulation and debug loop end-to-end, reading the spec, writing RTL, simulating, and iterating on failures.

Read Spec Write RTL Simulate Debug Loop

Gen 1

Multi-Agent System

5 agents

An orchestrator coordinates three parallel RTL designers and a reviewer, enabling concurrent solution exploration with quality-gated aggregation.

Orchestrator RTL Designer ×3 Reviewer

Gen 2

Hierarchical Agent Network

16 agents

Three independent solver teams compete and debate. An evidence-weighted aggregator and adversarial verifier resolve disagreements before a simulation specialist validates the final output.

Orchestrator Solver Team ×3 Aggregator Adversarial Verifier Sim Specialist

The Gen 2 hierarchical network adds a communication and debate layer: after each solver team independently produces a candidate solution, an evidence-weighted aggregator reconciles the outputs and an adversarial verifier stress-tests the result before a simulation specialist validates it. This pipeline reduces both false positives and convergence failures. Each generation is itself a product of the self-improving loop — built from what the previous generation learned running real design workflows.

References

CVDP Benchmark Comprehensive Verilog Design Problems — arxiv.org/abs/2506.14074
ACE-RTL When Agentic Context Evolution Meets RTL-Specialized LLMs — arxiv.org/abs/2602.10218

Cite this work

@misc{agentrys2026cvdp,
  title   = {Agentrys: Self-improving Agent Solving CVDP RTL Coding Tasks},
  author  = {Tsai, Yun-Da and Ding, Duo and Li, Wuxi and Ren, Haoxing},
  year    = {2026},
  month   = {February},
  url     = {https://agentrys.ai/blog-cvdp.html}
}