
Brandon Gubitosa
June 23, 2026
10 min read
June 23, 2026
10 min read
Cut code review time & bugs by 50%
Most installed AI app on GitHub and GitLab
Free 14-day trial
Cheap code generation broke the math your team runs on. When engineers have agents writing more diffs than reviewers can absorb, verification now controls throughput. Addy Osmani, who led developer experience for Google Chrome, calls this the code review bottleneck: "The bottleneck moved from writing code to proving it works."
Agentic engineering means directing AI agents to write and ship code at scale, with engineers steering intent while the agents produce the diffs. Adoption advice often focuses on productivity gains, orchestration patterns, and individual-developer "level up" ladders. Team adoption fails later, in the gap between generated-code volume and review capacity. Scaling agentic engineering without turning review into theater starts with risk-based routing, an independent first-pass reviewer on every diff, and metrics that track review capacity and escaped defects.
Cheap generation breaks a review assumption that used to keep teams honest. The speed asymmetry is precise: "When code was expensive to produce, senior engineers could review faster than junior engineers could write. AI flips this: a junior engineer can now generate code faster than a senior engineer can critically audit it. The rate-limiting factor that kept review meaningful has been removed."
AI-assisted development shows the same reported sharp review pressure. In those reported workflows, pull request volume rises while review time also climbs. GitHub recorded higher PR volume in 2025, with 518.7 million pull requests merged in public repositories across the year, up 29% year over year.
Osmani calls the cost comprehension debt: "the growing gap between how much code exists in your system and how much of it any human being genuinely understands." Comprehension debt stays invisible on a velocity chart until it becomes a 2 AM incident nobody on call can debug.
When generated diffs arrive faster than reviewers can read them, the queue fails in familiar ways. Batches get larger, approvals turn into signatures, and the few people who understand the system deeply carry more cognitive load.
A large engineering org hit the same review overload pattern. Code volume rose, PRs grew very large, and review latency climbed quarter over quarter. Then the telling signal appeared: "review time for the largest pull requests began to plateau, or even decline. This indicated that reviewers were no longer meaningfully engaging with changes." The queue got rubber-stamped instead of breaking loudly.
Large PRs are one place it starts. Review usefulness tends to fall as larger reviews touch more files, and AI-generated code can arrive in those chunks. A developer prompts for a feature, gets hundreds of lines, skims it, and opens the PR. The reviewer inherits a diff that is too big to read carefully and too plausible-looking to question.
Plausible-looking AI code can compile, pass the linter, often pass tests, and still carry latent logic errors. Reviewers have to walk through whether the code matches the intent, because verifying plausible correctness gets harder when the output is fluent and confident.
The State of AI report found the same pattern in 470 open-source PRs. AI-coauthored PRs averaged 10.83 issues per PR versus 6.45 for human-only PRs. At the 90th percentile, AI-coauthored PRs had 26 issues versus 12.3 for human-only PRs, and logic and correctness findings were 75% more common in AI PRs.
Verification debt is the backlog of unreviewed, unvalidated AI-generated code that piles up faster than your team can burn it down. Left alone, it becomes the new tech debt, except it is sitting in production instead of a backlog.

freee's engineering team hit this wall as AI-coding agents produced more PRs than humans could absorb. The first win was taking low-cost review work off human reviewers before the PR reached them. After adopting CodeRabbit, the team saved 32.8 weeks of reviewer time over six months.
freee's win came from offloading low-cost review work before it hit the queue, a team-level outcome. Agentic maturity shows up in team review artifacts. Individual adoption ladders can reward comfort with delegation, but team readiness depends on what happens after engineers delegate code generation. When engineers each stop looking at diffs, multiple under-reviewed pipelines feed one overwhelmed queue.
Microsoft's adoption maturity model frames agentic maturity as organizational capability. Treat readiness the same way. Before scaling, teams need shared review workflows, validation standards tied to PR size and risk, security checks, and linked-ticket validation.
The team-level shift is structural. Humans move from producing code to owning intent, architecture, and accountability while agents execute. Research on agentic software engineering describes the engineer's role shifting from code producer to orchestrator of sociotechnical systems, with humans responsible for intent specification, architectural coherence, and outcome accountability.
Orchestration only works if the feedback loop is fast. A review loop that takes a day turns fast agents into fast mistakes.
Risk-based routing gives reviewers fewer, better-targeted diffs. You can't fix this by asking reviewers to read harder. Google's own codebase health standard warns against review friction that blocks useful changes: "If a reviewer makes it very difficult for any change to go in, then developers are disincentivized to make improvements." Routine changes should move through faster, while risky changes should reach humans with the context to judge them.
Meta built a large-scale version of this. Their Diff Risk Score system, shipped in 2025, preserves rigor by routing attention toward diffs more likely to introduce risk.
Routine low-risk diffs proceed with minimal human involvement; higher-risk diffs route to reviewers with clear ownership. The same logic works in a simpler team workflow. Peer review can cover functionality and edge cases, while senior review should handle architecture, core infrastructure, risky paths, and new patterns.
Osmani gives the rule for AI security review: "If code touches auth, payments, secrets, or untrusted input, treat AI as a high-speed intern and require a human threat model review plus a security tool pass before merge." Tiering spends scarce senior-reviewer attention where it changes the outcome, leaving unused-import nits to machines upstream.
Keep humans on architecture decisions, security policy, regulated-release sign-off, roadmap priority calls, and accountability for autonomous agent actions. The developer still owns the merge.
Routing still needs independence. The AI that wrote the code carries one interpretation of the codebase into its own review, and it will miss some blind spots baked into that interpretation. A 2025 self-correction blind spot study reported that models can miss errors in their own output that they would catch in someone else's.
Every change needs an independent reviewer before the human reviewer arrives. The first pass should read the diff with fresh assumptions, handle the cheap layer, and hand your senior engineer a cleaner PR than the one the agent opened. CodeRabbit can run that independent first pass before human review.
Merge judgment stays with the developer and the human reviewer. CodeRabbit clears the cheap layer, surfaces context-rich issues early, and keeps human review focused on the decisions only a human should own. Codegraph gives the first pass repo context, and MCP connections and linked issues bring outside and ticket context into the review.
Path & AST-based instructions apply team rules to the files that need them. More than 40 Linters and static application security testing (SAST) tools catch structural and security issues before a human reviewer opens the diff. PR feedback becomes CodeRabbit Learnings, which improve repo-specific guidance over time, so your senior engineer opens a cleaner diff and spends review time on the calls only a person should make.
Velocity alone will lie to you. It looks great until the quality cost lands two quarters later. So what should leaders watch instead? The DevOps Research and Assessment (DORA) program's DORA 2024 data showed a 25% increase in AI adoption correlating with a 7.2% decrease in delivery stability. DORA calls the audit work the verification tax: "Time saved writing is often re-spent auditing."
Leaders should watch whether the review loop is holding by tracking review latency, defect escape rate, rubber-stamp coverage, and reviewer time.
Review latency tracks whether cycle time trends down or stays flat as PR volume grows. If review latency climbs alongside volume, the queue is filling faster than it drains. Jellyfish treats cycle time as a diagnostic metric beneath lead time for changes, surfacing before deployment frequency tanks.
Defect escape rate measures the percentage of bugs that reach production instead of getting caught in review. Tag escapes as "should have been caught" versus "genuinely subtle," because that split is where the signal lives. A rising should-have-been-caught rate is a strong fingerprint of rubber-stamping.
Rubber-stamp coverage identifies how much of the queue gets a real read versus a pass-through. PR size is the leading indicator here. Jellyfish also lists pull request size as a diagnostic metric, and when approvals on diffs of more than 500 lines come back in minutes, the queue is producing signatures instead of reviews. DX's guidance matters: track these at the team level, and never tie throughput metrics to individual performance, or you will incentivize review theater.
Reviewer time measures the thing the bottleneck actually consumes. A review dashboard can track queue depth, review latency, team-level performance, and review-quality metrics. The adoption signal is whether those numbers stayed under control as generated-code volume rose.
Start by fixing review. Teams that adopt coding agents first and bolt review on later generate verification debt faster than they can pay it down. Harden the review loop so it can absorb volume, then turn up the agents on top of it.
Taskrabbit's engineering team made that sequence explicit by fixing the review bottleneck before adopting AI coding agents, reducing time to merge by 25%, from 10 days to 7.
Agentic engineering raises the stakes on the review loop. The teams that scale agents without shipping more bugs treat verification as a structural constraint, route human judgment to the changes that carry risk, and put an independent reviewer on every diff before a human opens it. CodeRabbit provides that independent reviewer while merge accountability stays with the developer. Every line still earns its merge while your team moves at agent speed.
Cut code review time & bugs by 50%. Start a free 14-day trial.