The practical guide to agentic context engineering

Context engineering determines whether your AI code review agent catches the bug or lets it ship. Context engineering selects the code, tickets, conventions, and prior decisions the model sees before it answers. For teams running agentic workflows, review quality depends on whether the agent can see what a senior engineer would catch.

Agentic context engineering is the practice of assembling that information for an autonomous agent, not a single prompt. In review workflows, the work shifts from writing better instructions to assembling the right inputs. As Philipp Schmid put it: "Agent failures aren't only model failures; they are context failures." So when your AI reviewer misses a race condition or flags a false positive, check the context it received before blaming the model.

A good review needs more context than the diff

An AI agent reviewing from the diff alone sees a fraction of what a human reviewer carries. The diff shows what changed. It leaves out why, what else the code touches, what constraints apply, and what the team's conventions require. It's like judging a surgery from the stitches alone, which is why trustworthy review needs more context than the diff.

Even the best current models still miss a large share of issues human reviewers catch, according to the SWE-PRBench benchmark study.

To review like a human, an agent needs four inputs the diff doesn't carry:

The code's structure never appears in a diff. Function boundaries and control flow live in the abstract syntax tree (AST), the parsed structure of the code. The Ericsson experience report describes pulling out the enclosing method of the changed lines and passing it to the reviewer as structure.
The call graph tells the agent what depends on the change. A diff shows a function changed, not every caller across the codebase.
The team's conventions aren't in the model's training data. A Chalmers University study notes that when an LLM can't see the architecture beyond the changed file, its suggestions turn inaccurate, and real design problems slip through. Across CodeRabbit's State of AI report dataset, readability issues were 3.15x more common in AI PRs, what you get when the reviewer doesn't know the team's conventions.
The ticket carries the intent the code is judged against. O'Reilly's analysis makes the point: without the requirements, an AI reviewer can tell whether the code is well built, but not whether it does what it should.

common app coderabbit case study

At Common App, a 20-developer team working across .NET Core, Node.js, Angular, and Python cut code review time 35% and caught a race condition their prior checks had missed. Once the reviewer can see the wider codebase, subtle bugs stop hiding behind a clean-looking change.

Context collapse is a silent production failure mode

When an agent compresses away the detail that matters, it stops catching edge cases, and you won't know until the defect hits production.

The ACE paper (short for Agentic Context Engineering) describes one way context gets lost, which it calls brevity bias: the process keeps shrinking instructions toward short, generic ones. The paper shows these methods churning out near-identical instructions like "Create unit tests to ensure methods behave as expected," dropping the domain-specific detail. LLMs perform best with long, detailed context, not short prompts.

Context collapse happens while the agent runs. When a system rewrites its whole context on every turn instead of adding to it, each rewrite comes out shorter and vaguer than the last, and detail from earlier turns disappears.

Spreading context across many turns hurts accuracy, as a Microsoft Research/Salesforce study found. A bigger model won't fix this. The model loses the thread as the conversation piles up.

The same dataset shows error and exception-handling problems are nearly 2x more common in AI PRs, exactly the edge cases that thin context misses.

The ACE framework adds to context instead of overwriting it, recording each new change rather than re-summarizing everything. That keeps the detail summaries strip out.

In CodeRabbit, Learnings work on the same principle. When an engineer corrects a review comment, it becomes a learning the agent carries into future reviews.

Generation & verification need different context

Generation and verification agents need context organized for different jobs. Agentic context engineering means building each deliberately instead of reusing one for both. Treating them as interchangeable is how teams end up trusting output that was never properly checked.

Martin Fowler's documentation makes the key point: an agent gets less effective with too much context. Generation context should stay lean, focused on the intent, spec, and constraints. Verification context needs the original intent, the generated code, and the surrounding codebase.

Too much codebase context can hurt generation, because the agent copies existing patterns instead of building what the spec asks for. Too little verification context means the reviewer misses cross-service issues, duplicated logic, and drift from the intended design. When one agent does both jobs, the assumptions it made writing the code carry into its review, so its blind spots go unchecked. AI PRs also carry more defects overall: 10.83 issues per AI PR versus 6.45 per human PR. Generating fast without separate verification turns that gap into a backlog of unverified work.

Teams already spend extra time checking AI output. A separate review agent avoids that, starting from the original intent and finished code, not the assumptions that produced it.

How do you know if better context actually worked?

You find out from what slips past review, not from how fast you ship. DORA's (DevOps Research and Assessment) 2025 data shows that as AI adoption rises, teams ship more code and break it more often at the same time.

Faros AI argues that activity metrics like lines of code create a false sense of progress, while quality signals like escaped bugs, incidents, failed changes, and rework tell the real story.

The 'freee' logo and 'CodeRabbit CASE STUDY' title card on a dark, patterned background.

At freee, the bottleneck was reviewer capacity, not coding speed. The team saved 32.8 weeks of reviewer time in the last six months while handling more PRs across hundreds of repos. Measure whether you're freeing reviewer time without quality slipping. If your AI rollout only raises output, you're just going faster. If it frees reviewer time and quality holds, your verification is working.

Track four numbers: escaped defects, failed changes, review latency, and missed findings.

Defect escape rate is the share of defects that reach production instead of being caught earlier. It's the number healthy-looking activity stats most often hide.
Change failure rate (DORA) is the share of deployments that cause a failure. Still useful, but read it next to escaped defects and review quality, not on its own.
Review cycle time is the time from opening a PR to merging it, as DX's PR metrics define it. Studies disagree on whether AI shortens it, so don't assume it will.
False negative rate is an issue the AI reviewer missed that later reached production. The CR-Bench benchmark scores review agents on how many flags are real and how many real issues they catch, since false alarms are expensive. It ties most directly to how much context the agent had.

Watch escaped defects and false negatives, not the activity charts. When they fall as you add context, you have your answer.

Govern what enters the agent's context window

The moment an agent acts on your codebase, what it's allowed to see and do becomes a security and audit question. Traditional IAM (Identity and Access Management) assumes human users with predictable access. AI agents break that model. Their role can change mid-task, they move at machine speed across many systems, and standard logs record what happened but not why.

AI governance research warns that agents can leak secrets like API keys and credentials when context and permissions aren't well governed. Security findings are 1.57x more common in AI PRs, which is why controlling what an agent can access is part of getting the review right.

Limit what the agent can see and what it can do with it:

Filter secrets first. Scan for credentials and strip them before code reaches the agent.
Give permissions that expire. Grant access for the task, then revoke it when the task ends.
Match the agent to the developer's access. Scope repository access through single sign-on (SSO) and role-based access control (RBAC) so the agent gets the same access the developer has, never superuser rights.
Log what the agent saw. Record what entered the agent's context, where it came from, and which policy allowed it, so every decision is traceable.

Control what the agent sees and log all of it, so every review runs on context you can account for.

Build versus buy for the context layer

Building your own context layer requires a dedicated platform team that owns it permanently, which most organizations can't staff.

The cost doesn't end at launch. Someone has to keep the system that pulls in context running, keep the codebase graph current, and update the agent's instructions as conventions change. This is context drift, a constant cost. If a team switches from Jest to Vitest but doesn't update the AI's instructions, the agent keeps writing Jest tests, and every stale instruction lowers review quality.

Building gets you customization, but it's a permanent engineering project. Buying gets you speed, but you depend on someone else's roadmap. For most teams the decision comes down to one question: should the context layer be a problem you own, or one you hand off?

The test for context-aware review

For code review, agentic context engineering has a concrete test: can the reviewer see the codebase graph, the team's conventions, the linked tickets, and past review decisions before it comments? CodeRabbit's context engine, reviewing over 2 million PRs per week across 3 million+ repositories, assembles it automatically for every PR through Codegraph, accumulated Learnings, and MCP (Model Context Protocol) connections. A diff-only reviewer can point at the changed lines. A context-aware reviewer can judge whether the change belongs.

Cut code review time and find more bugs. Start a free 14-day trial.