

Brandon Gubitosa
June 24, 2026
10 min read
June 24, 2026
10 min read

Cut code review time & bugs by 50%
Most installed AI app on GitHub and GitLab
Free 14-day trial
Context engineering is the work of getting the right information and structure to an AI agent. At team level, it means shared conventions, codebase knowledge, prior pull request (PR) history, and decisions every agent should inherit.
Most explainers stop at a solo developer tuning one prompt. For teams shipping production code, the discipline is larger, because thin or stale context causes hallucinated changes, convention drift, escaped defects, and review queues that stop moving.
Prompt wording still matters, but team-scale AI development now depends on how agents retrieve, retain, and apply project context. Get it wrong and agents generate plausible code that breaks in ways reviewers can't catch fast enough. Get it right and that same context becomes what a reviewer needs to judge the result.
Prompt engineering optimizes wording. Context engineering adds the retrieval, memory, tools, and structure that determine what the model can use when it acts.

In a June 19, 2025 post, Shopify CEO Tobi Lütke defined it as "the art of providing all the context for the task to be plausibly solvable by the LLM."
Anthropic defines prompt engineering as "methods for writing and organizing LLM instructions." It describes context engineering as "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts." The difference is where the knowledge enters the work. Prompt engineering encodes it at write time. Context engineering retrieves it at runtime from vector databases, APIs, memory stores, and tool outputs.
The center of gravity has shifted from wording to wiring.
Agents made context harder because every step creates more context to curate. LLM apps moved from single-turn chat to multi-step agentic workflows, where agents accumulate tool outputs, retrieved files, intermediate reasoning, and prior responses. A typical agent task can involve many tool calls, each adding context that can degrade performance.
Bigger context windows don't fix this, and can make it worse. More tokens bury the evidence a model needs, especially when it moves around inside the prompt. Research on long input contexts found performance degrades when relevant information shifts position. A 2025 analysis found accuracy drops when relevant tokens sit inside longer contexts, even with irrelevant ones masked out.
ChromaDB's research on "context rot" found the same pattern: output grows less reliable as input length grows. You still choose what goes in, order it, and compress the rest. So what counts as the right context for a coding agent?
An agent's context is more than the prompt. It runs from the files it can read to the operating model around them, and that mix decides whether the agent acts safely or just plausibly.
LangChain organizes context engineering around four operations: write, select, compress, and isolate. For coding work, the raw ingredients are codebase context, git history, dependencies, tool definitions, team standards, and retrieved documentation. The labels matter less than the point underneath them. Every one of these is something a reviewer also needs to judge the result, so thinning the agent's context thins the reviewer's at the same time.
Anthropic identifies bloated tool sets as a common failure mode: "If a human engineer can't definitively say which tool should be used in a given situation, an AI agent can't be expected to do better." Research on tool selection found that showing a model the right tools, rather than all of them, improves accuracy. More context is not automatically better context.
On a team, context is shared infrastructure. It has to live somewhere every agent and contributor can inherit it.
One common pattern is version-controlled rules files. AGENTS.md, CLAUDE.md, .cursorrules, and SKILL.md sit in the repo and give persistent, project-specific guidance: build commands, coding conventions, testing rules, and constraints the agent can't infer from code alone. A global file carries personal defaults, a repo-level file carries team standards, and subdirectory files carry overrides.
Martin Fowler notes that encoding team standards is not new, since teams already do it with linting rules, CI pipelines, and infrastructure-as-code. What changes with AI is the scope of what can be encoded. Linting catches syntax. Executable team standards can encode architectural judgment and review rigor that used to transfer only through pairing, mentorship, and shared experience. Research on AGENTS.md efficiency found these files reduce the need for agents to infer project structure through exploratory navigation.
At team scale, two failures dominate: session amnesia and staleness. Without a checked-in context file, every session starts from scratch. And when the file falls behind the code, the agent keeps building on rules that no longer hold. The team moves to Vitest but the file still says run jest. Service boundaries shift but the file describes the old layout. One write-up puts it plainly: the agent keeps generating code from outdated rules, and every developer pays the correction cost without seeing the cause.
The damage shows up in the metrics engineering leaders already track. Across teams, thin or stale context tends to fail in the same four ways.

Hallucinated dependencies. A USENIX 2025 study found that coding models hallucinate packages at notable rates, with open-source models worse than commercial ones. A paper on "slopsquatting" describes the attack that follows: an adversary registers a malicious package under one of those hallucinated names.
Convention drift. An agent won't know your conventions, like which libraries you standardize on, or that a shared utility already exists for what it's about to rebuild. Martin Fowler describes the distributed cost, where AI-generated code drifts from team conventions when one developer prompts and aligns when another does. CodeRabbit's AI vs human report, based on 470 PRs, puts a number on it: AI-co-authored PRs had 3.15x more readability issues than human-only PRs.
Escaped defects. GitClear's analysis of 211 million lines of code changes found that newly added code was revised within two weeks more often in 2024 than in 2020. More code is landing, and more of it gets reworked soon after.
The security picture is sharper. Reporting on Apiiro's analysis across Fortune 50 repositories, the Cloud Security Alliance found that AI-assisted developers committed code at three to four times the rate of non-AI peers. Over six months, monthly security findings rose roughly tenfold. Syntax errors dropped 76% and logic bugs fell 60%. But privilege escalation paths rose 322% and architectural design flaws rose 153%. Local correctness improved while system-level risk got worse.
The same report found a review asymmetry. AI-co-authored PRs produced about 1.7x more issues overall, and logic and correctness issues, including business logic mistakes, were 75% more common in AI PRs. AI keeps improving at local, pattern-matchable work while broad, context-dependent work still needs a deeper read.
The same knowledge base that helps an agent write code is what helps a reviewer judge whether it fits the system around it.
On the generation side, moving a model from in-file-only context to cross-file retrieval can substantially improve exact-match accuracy.
On the verification side, a code-review comprehension study found that reviewers more familiar with a PR's context need fewer resources to understand it. Reviewers without that context fall back on surface-level feedback.
Early AI review systems failed for the same reason: they worked at restrictive method-level granularity and skipped the cross-file context that makes a judgment trustworthy.
Generation and review fail the same way. An agent that missed an edge case while writing code is unlikely to catch it while reviewing, because it's working from the model it built during generation. It's the argument for a verification layer that sits apart from the agent that wrote the code and works from deeper context.
When code creation accelerates, review becomes the constraint. GitHub's Octoverse 2025 counted 518.7 million merged pull requests in 2025, up 29% year over year. Code generation is outpacing human attention, and the people doing the reviewing didn't grow 29%.
DORA's 2025 report is blunt about it: "AI is an amplifier." The speed gains, it warns, often get swallowed downstream by bottlenecks in testing, security review, and deployment. Human review of AI-generated code can swamp the best reviewers unless teams improve the review experience.
The tail is what hurts. The AI vs human report found a 2.11x gap at the 90th percentile between AI PRs and human-only PRs. A handful of high-issue outlier PRs eat far more reviewer time than the median, and those are what back up the queue.
Taskrabbit fixed review before adopting coding agents, cutting average time to merge by 25%, from 10 days to 7. After AI-coding agents drove up PR volume, freee saved 32.8 weeks of reviewer time over six months.
Under heavy autonomous-development load, Abnormal AI reached a 65%+ acceptance rate on critical-severity review comments as its agent output scaled. The lesson repeats: generation gains only matter if verification scales with them.
Context-rich AI code review attacks the bottleneck from the verification side. Diff-only review produces surface-level nits because it can't see past the diff. CodeRabbit reviews in the PR, IDE, and CLI. Its context engine indexes your codebase, linked tickets, prior PRs, and team decisions. Feedback reflects the system around the change, not just the lines that moved. It also reads existing .cursorrules and .copilot-instructions, and CodeRabbit Learnings turn reviewer feedback in review conversations into learnings that improve future reviews.
CodeRabbit provides first-pass review before a human opens the diff, while the developer remains responsible for shipping.
Context engineering is the foundation under planning, review, and trust across the agentic lifecycle. The real work is deciding what context to include, where it lives, how it persists across sessions and agents, and how it stays current as the code changes. When that context is deep and shared, agents write better code and reviewers, human and AI, can judge it. When it's thin or stale, broad system risks survive local checks.
For CodeRabbit, the context engine is the operating layer for verification. It builds cross-file context through code graph analysis, carries your team's standards into review, and adds automated linters. It also turns reviewer feedback into learnings that sharpen the next pass. All of it points at the review half of the lifecycle, where the dangerous defects survive.
But the tooling is downstream of the decision. What protects your team is treating context as infrastructure: deciding what your agents and reviewers can see, and keeping it current. Get that right, and review stops being the thing that breaks when everything else speeds up.
Catch more issues before they reach production. Start your free 14-day trial.