Scaling Code Review with AI: Cloudflare's Multi-Agent Orchestration

By
<h2 id="introduction">Introduction</h2><p>Code review is a cornerstone of modern software development, catching bugs early and spreading knowledge across teams. Yet it can also become a bottleneck, with merge requests languishing in queues as reviewers struggle to context-switch. At Cloudflare, the median wait for a first review often stretched into hours. To address this, we built an AI-powered code review system that uses a coordinated team of specialized agents, dramatically reducing review times while maintaining high quality. This article details our journey from experimentation to production, sharing the architecture and lessons learned.</p><figure style="margin:20px 0"><img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3g2Vqql5biqvjvXwxhDb3b/b0c7fd707437eff2a7acb9d3172368e4/BLOG-3284_OG.png" alt="Scaling Code Review with AI: Cloudflare&#039;s Multi-Agent Orchestration" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: blog.cloudflare.com</figcaption></figure><h2 id="the-problem">The Problem with Traditional Code Review</h2><p>Merge requests can stall for many reasons: reviewer availability, cognitive load from context-switching, and an endless cycle of nitpicks and revisions. We saw this firsthand across thousands of internal projects. While automated tools like linters help, they only catch surface-level issues. We needed something that could understand code semantics, flag real bugs, and scale across our diverse codebases.</p><h2 id="early-attempts">Early Attempts: From Off-the-Shelf Tools to Naive Prompts</h2><p>Our first step was evaluating existing AI code review tools. Many worked well and offered customization, but none provided the flexibility needed for an organization of Cloudflare's size. So we pivoted to a DIY approach: feeding git diffs into a large language model with a generic prompt. The results were noisy—vague suggestions, hallucinated syntax errors, and irrelevant advice like “consider adding error handling” on functions that already had it. Clearly, a naive approach wouldn't work for complex codebases.</p><h2 id="the-solution">The Solution: Multi-Agent Orchestration</h2><p>Instead of building a monolithic reviewer, we created a CI-native orchestration system atop OpenCode, an open-source coding agent. Now, when a Cloudflare engineer opens a merge request, it gets an initial pass from a coordinated team of up to seven specialized AI agents:</p><ul><li><strong>Security</strong> – Scans for vulnerabilities and insecure patterns</li><li><strong>Performance</strong> – Identifies potential slowdowns and inefficiencies</li><li><strong>Code Quality</strong> – Checks for type errors, dead code, and stylistic issues</li><li><strong>Documentation</strong> – Ensures comments and docs are accurate and complete</li><li><strong>Release Management</strong> – Verifies version bumps and changelog entries</li><li><strong>Compliance</strong> – Enforces internal Engineering Codex rules</li><li><strong>Coordinator</strong> – Deduplicates findings, judges severity, and posts a single structured review</li></ul><h3>How the Coordinator Works</h3><p>The coordinator agent is the linchpin. It collects outputs from all specialists, removes duplicates, evaluates the true severity of each issue (e.g., blocking vs. advisory), and compiles a single, readable comment. This prevents the noise of multiple overlapping suggestions and gives engineers a clear action list. The system can automatically approve clean code, flag real bugs, and even block merges when it detects serious problems or security vulnerabilities.</p><figure style="margin:20px 0"><img src="https://blog.cloudflare.com/cdn-cgi/image/format=auto,dpr=3,width=64,height=64,gravity=face,fit=crop,zoom=0.5/https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4veI2sDj3FhForbfne4tQB/a9e868ac9a0727780f404a6c9a37a9dc/IMG_0052_-_Cropped.jpg" alt="Scaling Code Review with AI: Cloudflare&#039;s Multi-Agent Orchestration" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: blog.cloudflare.com</figcaption></figure><h2 id="results-and-impact">Results and Impact</h2><p>We've run this system internally across tens of thousands of merge requests. Key outcomes include:</p><ul><li>Reduced median first-review wait time from hours to minutes</li><li>High accuracy in identifying genuine bugs and security issues</li><li>Automatic approval for trivial or well-tested changes</li><li>Blocking merges only for critical problems, minimizing developer frustration</li></ul><p>This system is part of our broader Code Orange: Fail Small initiative, aimed at improving engineering resiliency.</p><h2 id="architecture-deep-dive">Architecture Deep Dive</h2><p>Building an LLM-powered system at the heart of CI/CD presented unique challenges. We had to handle model latency, API failures, and varying output formats. Our architecture uses a plugin-based design: each specialist is a modular plugin with a specific prompt and context. The coordinator uses a lightweight LLM call to merge results. This modularity lets us add or swap agents without rebuilding the whole system. We also implemented guardrails to prevent the system from becoming a blocker—for example, if the coordinator times out, the review defaults to a human-friendly summary.</p><h3>Lessons Learned</h3><p>We discovered that:</p><ol><li>Specialization beats generalization. A single model with a massive prompt produced worse results than multiple targeted models.</li><li>Deduplication is critical. Without it, engineers would ignore the output as noise.</li><li>Severity estimation requires careful tuning. Overly aggressive blocking erodes trust.</li><li>The system must be fast. Engineers won't wait minutes for an AI review during a hotfix.</li></ol><h2 id="conclusion">Conclusion</h2><p>AI-assisted code review can be both scalable and reliable when built as an orchestration of specialized agents rather than a monolithic black box. At Cloudflare, this system has cut review wait times, caught real bugs, and become a trusted part of our development workflow. We're excited to continue refining it and sharing our findings with the community.</p><p>For more details, see <a href="#the-problem">our initial challenges</a> or jump to <a href="#architecture-deep-dive">the architecture discussion</a>.</p>
Tags:

Related Articles