What is context engineering? A primer for AI-assisted teams

Brandon Gubitosa

June 24, 2026

10 min read

June 24, 2026

10 min read

Context engineering vs prompt engineering: what actually changed
Why agents broke the prompt-engineering model
What a coding agent needs in its context
How context engineering changes at the team level
- Context lives in version-controlled files
- Stale context is a governance failure, not a typo
What goes wrong when context runs thin or stale
Why deep context underwrites generation and verification together
Where the bottleneck actually lands
- Teams that scaled review first
- What context-rich review changes
What deep, shared context actually buys a team

Back to guides

Cut code review time & bugs by 50%

Most installed AI app on GitHub and GitLab

Free 14-day trial

Get Started

CR_Flexibility.

Frequently asked questions

What is context engineering in simple terms?

Context engineering is the discipline of getting the right information, tools, and structure to an AI model at the right time. For coding agents, it means supplying codebase knowledge, dependencies, team conventions, and prior decisions alongside a well-worded prompt.

How is context engineering different from prompt engineering?

Prompt engineering is about wording an instruction. Context engineering is about what information the model can access when it acts on that instruction. Prompt engineering encodes knowledge at write time. Context engineering retrieves it at runtime from the codebase, tickets, and tool outputs.

Why does context engineering matter more for teams than individuals?

For a solo developer, context can live in one person's head and one prompt. For a team, it has to be encoded in shared, version-controlled files like AGENTS.md so every agent and contributor inherits the same conventions. Session amnesia and context staleness dominate at team scale.

How does context engineering improve AI code review?

The same repository context that helps an agent write correct code helps a reviewer judge whether it's correct. Research shows reviewers with deep PR context need fewer resources to understand a PR and give constructive feedback, while reviewers without it default to surface-level nits. AI review that only sees the diff misses cross-file impact. Review with codebase, linked-ticket, prior-PR, and team-decision context can reason about the full picture.

What happens when AI agents work with thin or stale context?

They keep operating from the wrong project memory. That can mean a hallucinated package, a library choice the team already rejected, or code that passes local checks but violates the system's design constraints.

Catch the latest, right in your inbox.

Add us your feed.

Catch the latest, right in your inbox.

Add us your feed.

Keep reading

Collaborative AI: Repo rules, tickets, and review history for the agentic SDLC

Collaborative AI keeps humans and agents working from shared repo rules, tickets, and review history so teams can trust and build on AI-generated code.

Code context: The evidence behind trustworthy AI code review

Code context is the evidence an AI reviewer sees beyond the diff. Here's why deep context, not a bigger window, makes AI code review trustworthy.

Adopt agentic engineering without losing your review loop

Agentic engineering typically breaks in the review queue. In this piece we go over risk-tier reviews, adding an independent first pass, and tracking the metrics that hold.

Get
Started in
2 clicks.

No credit card needed

Install in VS Code

Context engineering is the work of getting the right information and structure to an AI agent. At team level, it means shared conventions, codebase knowledge, prior pull request (PR) history, and decisions every agent should inherit.

Most explainers stop at a solo developer tuning one prompt. For teams shipping production code, the discipline is larger, because thin or stale context causes hallucinated changes, convention drift, escaped defects, and review queues that stop moving.

Prompt wording still matters, but team-scale AI development now depends on how agents retrieve, retain, and apply project context. Get it wrong and agents generate plausible code that breaks in ways reviewers can't catch fast enough. Get it right and that same context becomes what a reviewer needs to judge the result.

Context engineering vs prompt engineering: what actually changed

Prompt engineering optimizes wording. Context engineering adds the retrieval, memory, tools, and structure that determine what the model can use when it acts.

Tobi Lutke's tweet explains "context engineering" as providing task context for LLMs to solve.

In a June 19, 2025 post, Shopify CEO Tobi Lütke defined it as "the art of providing all the context for the task to be plausibly solvable by the LLM."

Anthropic defines prompt engineering as "methods for writing and organizing LLM instructions." It describes context engineering as "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts." The difference is where the knowledge enters the work. Prompt engineering encodes it at write time. Context engineering retrieves it at runtime from vector databases, APIs, memory stores, and tool outputs.

The center of gravity has shifted from wording to wiring.

Why agents broke the prompt-engineering model

Agents made context harder because every step creates more context to curate. LLM apps moved from single-turn chat to multi-step agentic workflows, where agents accumulate tool outputs, retrieved files, intermediate reasoning, and prior responses. A typical agent task can involve many tool calls, each adding context that can degrade performance.

Bigger context windows don't fix this, and can make it worse. More tokens bury the evidence a model needs, especially when it moves around inside the prompt. Research on long input contexts found performance degrades when relevant information shifts position. A 2025 analysis found accuracy drops when relevant tokens sit inside longer contexts, even with irrelevant ones masked out.

ChromaDB's research on "context rot" found the same pattern: output grows less reliable as input length grows. You still choose what goes in, order it, and compress the rest. So what counts as the right context for a coding agent?

What a coding agent needs in its context

An agent's context is more than the prompt. It runs from the files it can read to the operating model around them, and that mix decides whether the agent acts safely or just plausibly.

LangChain organizes context engineering around four operations: write, select, compress, and isolate. For coding work, the raw ingredients are codebase context, git history, dependencies, tool definitions, team standards, and retrieved documentation. The labels matter less than the point underneath them. Every one of these is something a reviewer also needs to judge the result, so thinning the agent's context thins the reviewer's at the same time.

Anthropic identifies bloated tool sets as a common failure mode: "If a human engineer can't definitively say which tool should be used in a given situation, an AI agent can't be expected to do better." Research on tool selection found that showing a model the right tools, rather than all of them, improves accuracy. More context is not automatically better context.

How context engineering changes at the team level

On a team, context is shared infrastructure. It has to live somewhere every agent and contributor can inherit it.

Context lives in version-controlled files

One common pattern is version-controlled rules files. AGENTS.md, CLAUDE.md, .cursorrules, and SKILL.md sit in the repo and give persistent, project-specific guidance: build commands, coding conventions, testing rules, and constraints the agent can't infer from code alone. A global file carries personal defaults, a repo-level file carries team standards, and subdirectory files carry overrides.

Martin Fowler notes that encoding team standards is not new, since teams already do it with linting rules, CI pipelines, and infrastructure-as-code. What changes with AI is the scope of what can be encoded. Linting catches syntax. Executable team standards can encode architectural judgment and review rigor that used to transfer only through pairing, mentorship, and shared experience. Research on AGENTS.md efficiency found these files reduce the need for agents to infer project structure through exploratory navigation.

Stale context is a governance failure, not a typo

At team scale, two failures dominate: session amnesia and staleness. Without a checked-in context file, every session starts from scratch. And when the file falls behind the code, the agent keeps building on rules that no longer hold. The team moves to Vitest but the file still says run jest. Service boundaries shift but the file describes the old layout. One write-up puts it plainly: the agent keeps generating code from outdated rules, and every developer pays the correction cost without seeing the cause.

What goes wrong when context runs thin or stale

The damage shows up in the metrics engineering leaders already track. Across teams, thin or stale context tends to fail in the same four ways.

Diagram illustrates four failure modes of thin context: hallucination, convention drift, escaped defects, and review overload.

Hallucinated dependencies. A USENIX 2025 study found that coding models hallucinate packages at notable rates, with open-source models worse than commercial ones. A paper on "slopsquatting" describes the attack that follows: an adversary registers a malicious package under one of those hallucinated names.

Convention drift. An agent won't know your conventions, like which libraries you standardize on, or that a shared utility already exists for what it's about to rebuild. Martin Fowler describes the distributed cost, where AI-generated code drifts from team conventions when one developer prompts and aligns when another does. CodeRabbit's AI vs human report, based on 470 PRs, puts a number on it: AI-co-authored PRs had 3.15x more readability issues than human-only PRs.

Escaped defects. GitClear's analysis of 211 million lines of code changes found that newly added code was revised within two weeks more often in 2024 than in 2020. More code is landing, and more of it gets reworked soon after.

The security picture is sharper. Reporting on Apiiro's analysis across Fortune 50 repositories, the Cloud Security Alliance found that AI-assisted developers committed code at three to four times the rate of non-AI peers. Over six months, monthly security findings rose roughly tenfold. Syntax errors dropped 76% and logic bugs fell 60%. But privilege escalation paths rose 322% and architectural design flaws rose 153%. Local correctness improved while system-level risk got worse.

The same report found a review asymmetry. AI-co-authored PRs produced about 1.7x more issues overall, and logic and correctness issues, including business logic mistakes, were 75% more common in AI PRs. AI keeps improving at local, pattern-matchable work while broad, context-dependent work still needs a deeper read.

Why deep context underwrites generation and verification together

The same knowledge base that helps an agent write code is what helps a reviewer judge whether it fits the system around it.

On the generation side, moving a model from in-file-only context to cross-file retrieval can substantially improve exact-match accuracy.

On the verification side, a code-review comprehension study found that reviewers more familiar with a PR's context need fewer resources to understand it. Reviewers without that context fall back on surface-level feedback.

Early AI review systems failed for the same reason: they worked at restrictive method-level granularity and skipped the cross-file context that makes a judgment trustworthy.

Generation and review fail the same way. An agent that missed an edge case while writing code is unlikely to catch it while reviewing, because it's working from the model it built during generation. It's the argument for a verification layer that sits apart from the agent that wrote the code and works from deeper context.

Where the bottleneck actually lands

When code creation accelerates, review becomes the constraint. GitHub's Octoverse 2025 counted 518.7 million merged pull requests in 2025, up 29% year over year. Code generation is outpacing human attention, and the people doing the reviewing didn't grow 29%.

DORA's 2025 report is blunt about it: "AI is an amplifier." The speed gains, it warns, often get swallowed downstream by bottlenecks in testing, security review, and deployment. Human review of AI-generated code can swamp the best reviewers unless teams improve the review experience.

The tail is what hurts. The AI vs human report found a 2.11x gap at the 90th percentile between AI PRs and human-only PRs. A handful of high-issue outlier PRs eat far more reviewer time than the median, and those are what back up the queue.

Teams that scaled review first

Taskrabbit fixed review before adopting coding agents, cutting average time to merge by 25%, from 10 days to 7. After AI-coding agents drove up PR volume, freee saved 32.8 weeks of reviewer time over six months.

Under heavy autonomous-development load, Abnormal AI reached a 65%+ acceptance rate on critical-severity review comments as its agent output scaled. The lesson repeats: generation gains only matter if verification scales with them.

What context-rich review changes

Context-rich AI code review attacks the bottleneck from the verification side. Diff-only review produces surface-level nits because it can't see past the diff. CodeRabbit reviews in the PR, IDE, and CLI. Its context engine indexes your codebase, linked tickets, prior PRs, and team decisions. Feedback reflects the system around the change, not just the lines that moved. It also reads existing .cursorrules and .copilot-instructions, and CodeRabbit Learnings turn reviewer feedback in review conversations into learnings that improve future reviews.

CodeRabbit provides first-pass review before a human opens the diff, while the developer remains responsible for shipping.

What deep, shared context actually buys a team

Context engineering is the foundation under planning, review, and trust across the agentic lifecycle. The real work is deciding what context to include, where it lives, how it persists across sessions and agents, and how it stays current as the code changes. When that context is deep and shared, agents write better code and reviewers, human and AI, can judge it. When it's thin or stale, broad system risks survive local checks.

For CodeRabbit, the context engine is the operating layer for verification. It builds cross-file context through code graph analysis, carries your team's standards into review, and adds automated linters. It also turns reviewer feedback into learnings that sharpen the next pass. All of it points at the review half of the lifecycle, where the dangerous defects survive.

But the tooling is downstream of the decision. What protects your team is treating context as infrastructure: deciding what your agents and reviewers can see, and keeping it current. Get that right, and review stops being the thing that breaks when everything else speeds up.

Catch more issues before they reach production. Start your free 14-day trial.