CodeRabbit logoCodeRabbit logo
PlanEnterpriseCustomersPricingBlog
Resources
  • Docs
  • Trust Center
  • Contact Us
  • FAQ
  • Whitepapers
Log InGet a free trial
CodeRabbit logoCodeRabbit logo

Products

Pull Request ReviewsIssue plannerIDE ReviewsCLI ReviewsOSS

Navigation

About UsFeaturesFAQSystem StatusCareersDPAStartup ProgramVulnerability Disclosure

Resources

BlogDocsChangelogCase StudiesTrust CenterBrand GuidelinesWhitepapers

Contact

SupportSalesPricingPartnerships

By signing up you agree to our Terms of Use and Privacy Policy

discord iconx iconlinkedin iconrss icon
footer-logo shape
Terms of Service Privacy Policy

CodeRabbit Inc © 2026

CodeRabbit logoCodeRabbit logo

Products

Pull Request ReviewsIssue plannerIDE ReviewsCLI ReviewsOSS

Navigation

About UsFeaturesFAQSystem StatusCareersDPAStartup ProgramVulnerability Disclosure

Resources

BlogDocsChangelogCase StudiesTrust CenterBrand GuidelinesWhitepapers

Contact

SupportSalesPricingPartnerships

By signing up you agree to our Terms of Use and Privacy Policy

discord iconx iconlinkedin iconrss icon

A very brief history of AI coding, from Copilot to next-gen agents

by
David Kravets

David Kravets

March 18, 2026

|

9 min read

March 18, 2026

9 min read

  • Copilot made AI coding feel native
  • After autocomplete came intent
  • Conversation wasn’t enough, the assistant had to see the repo
  • An agent is a model that can act
  • Benchmarks stopped asking for functions and started asking for work
  • The background agent era
  • The terminal and the editor became control planes
  • Instructions and integrations became infrastructure
  • What the history actually shows
Back to blog
Cover image

Share

https://victorious-bubble-f69a016683.media.strapiapp.com/Reddit_feecae8a6d.pnghttps://victorious-bubble-f69a016683.media.strapiapp.com/X_721afca608.pnghttps://victorious-bubble-f69a016683.media.strapiapp.com/Linked_In_a3d8c65f20.png

Cut code review time & bugs by 50%

Most installed AI app on GitHub and GitLab

Free 14-day trial

Get Started

Catch the latest, right in your inbox.

Add us your feed.RSS feed icon
newsletter decoration

Catch the latest, right in your inbox.

Add us your feed.RSS feed icon

Keep reading

A very brief history of AI coding, from Copilot to next-gen agents

A very brief history of AI coding, from Copilot to next-gen agents

How code models became coding assistants, how assistants became agents, and how the practice of software engineering began to reorganize around them.

Meet CodeRabbit Plan: Better plans. Faster delivery. Less rework

Meet CodeRabbit Plan: Better plans. Faster delivery. Less rework

The challenge Teams using coding agents need prompts that are clear, specific and context-aware. That's exactly why we built CodeRabbit Plan, a collaborative planning tool that turns vague ideas into

Gemini 3.1 Pro for code-related tasks: More focus, higher signal-to-noise

Gemini 3.1 Pro for code-related tasks: More focus, higher signal-to-noise

In practice, developers experience AI code review through the comments it leaves on pull requests: how often it finds real issues, how much noise it produces, and how actionable its feedback is. To an

The one thing devs will still read when they stop reading code

The one thing devs will still read when they stop reading code

Code was never meant to be read. We just had no alternative. Consider a real-world example: a production payments service with layered retry logic, idempotency keys, circuit breakers, feature flags, a

Get
Started in
2 clicks.

No credit card needed

Your browser does not support the video.
Install in VS Code
Your browser does not support the video.

The history of AI coding agents begins before anyone seriously called them agents. In 2017, Attention Is All You Need paper introduced the Transformer, the architecture that made modern large language models possible.

In 2020, CodeBERT brought that foundation closer to software development by showing that natural language and programming language could be learned together in a single pretrained system for tasks like code search and documentation generation.

These were not agents in the modern sense. They did not open files, run tests, or act inside a development environment. But they established the premise that made everything else possible. Code could be modeled as language, and language models could learn useful representations of how software is written, explained, and transformed.

By 2021, that line of research had matured into practical, testable code generation. Codex 2021 described a GPT model fine-tuned on publicly available code and evaluated with HumanEval. GitHub announced Copilot on June 29, 2021 and the Codex paper followed on July 7, explicitly noting that a distinct production version of Codex powered Copilot. That detail matters because it marked the bridge LLMs crossed from research artifact to mainstream developer product.

Copilot made AI coding feel native

When Copilot arrived, it did something historically more important than “write code.” It made AI feel native to the act of programming. GitHub described Copilot as an AI pair programmer that could draw context from the code around it and suggest whole lines or even entire functions inside the editor.

That sounds ordinary now, but in 2021 it was a genuine interface breakthrough. Code generation stopped living in research demos and started living on the editing surface itself, where latency, relevance, and developer trust mattered more than abstract benchmark scores.

That is why Copilot accomplished more than a traditional autocomplete tool. Its significance was not just model quality. It was the product decision to put the model directly into the workflow of writing software.

GitHub’s later research found that Copilot users completed tasks faster and reported conserving mental effort. In other words, Copilot did not merely show that a model could emit code. It showed that AI assistance could change the process of software development.

After autocomplete came intent

The next important signal came in 2022 and it did not come from an editor. DeepMind’s AlphaCode showed that harder programming problems often require something beyond elegant one-shot generation. AlphaCode generated many candidate programs, filtered them aggressively, and leaned on program behavior rather than surface fluency alone. In competitive programming, it reached roughly the level of the median competitor.

Historically, AlphaCode mattered because it previewed a principle that later coding agents would rely on constantly. Difficult software tasks are often search problems, not just language problems.

Later that same year, ChatGPT made conversational interaction with a model mainstream, and InstructGPT had already shown why that mattered. Models tuned to follow user intent are more useful than models that merely continue text.

In March 2023, GitHub Copilot X brought that shift directly into software development with chat, pull request assistance, documentation help, and GPT-4 integration. From that point on, the relationship between developer and machine changed. You no longer had to wait for the right completion to appear.

You could explain what you wanted, ask for a refactor, request tests, or ask the system to explain unfamiliar code.

Conversation wasn’t enough, the assistant had to see the repo

As soon as coding AI became conversational, a new bottleneck appeared: context. Chat is only as good as what it can retrieve about the project in front of it. GitHub’s repository indexing docs make the shift explicit.

Indexing runs in the background, and once an index exists, Copilot Chat can answer questions about the repository in GitHub and in VS Code. This was the moment coding AI stopped acting like a brilliant stranger and started acting more like a coworker who had at least read the codebase.

At the same time, open code models started adapting more directly to how programmers actually edit. SantaCoder emphasized fill-in-the-middle generation. StarCoder pushed the open-model frontier with broader language coverage and longer context. Code Llama emphasized infilling and larger input windows.

Those details mattered because real developers rarely write left to right from a blank page. They insert, patch, refactor, stub, and repair inside existing systems. The training objective was beginning to match the mechanics of software work.

An agent is a model that can act

This is where the modern meaning of “agent” starts to crystallize. A coding model becomes a coding agent when it can do more than generate plausible code. It has to inspect files, call tools, run commands, observe failures, and continue.

The ReAct paper gave the field a crisp conceptual template for interleaving reasoning and action, while OpenAI’s function calling made tool use practical as a product and API pattern. Together, they shifted the field from passive generation toward closed-loop interaction with an environment.

That idea quickly became concrete. The authors of the SWE-agent paper argued that language-model agents needed their own “agent-computer interface” for navigating repositories, editing files, and executing programs.

Devin packaged a shell, editor, and browser inside a sandboxed compute environment. OpenHands turned the same thesis into a more open and composable stack that can run locally, in the terminal, or in CI/CD workflows. In each case, the breakthrough was not just better code generation. It was the ability to take an action, inspect the result, and try again.

Benchmarks stopped asking for functions and started asking for work

The benchmarks tell the history in miniature. In 2021, HumanEval measured whether a model could synthesize a correct function from a docstring. By 2023, the authors of the SWE-bench paper asked whether a system could resolve real GitHub issues in real repositories. That shift is enormous.

The field stopped asking whether a model could produce code that looked competent and started asking whether a system could actually complete software tasks under real constraints.

Then the bar rose again. SWE-bench Verified introduced a human-validated subset for more reliable evaluation. LiveCodeBench focused on contamination-free evaluation and explicitly broadened the target to include self-repair, code execution, and test-output prediction.

Terminal-Bench moved closer still to reality by measuring performance on hard, realistic command-line tasks. Evaluation stopped rewarding code that merely looked plausible and started rewarding systems that could finish real work.

The background agent era

By 2025, the category had changed shape again. As of early 2026, GitHub documents two complementary agent experiences. In VS Code, you can describe what you want to build and let an agent plan, implement, and verify changes across the project. In GitHub itself, Copilot coding agent works in the background as part of the pull request workflow. You assign work, it makes changes, opens a pull request, and then asks for review. The assistant no longer had to wait at the cursor. The agent could take a task and come back with work product.

Other platforms converged on the same pattern. Recycling the term Codex, the OpenAI Codex product was reintroduced in 2025 as a software engineering agent for longer-running tasks, while Google’s Jules is explicitly framed as an experimental coding agent that integrates with GitHub, works autonomously, and can open pull requests with runnable code and test results inside secure cloud VMs. The cloud sandbox became the natural habitat of the background coding agent.

The terminal and the editor became control planes

The local interface evolved in parallel. Claude Code is described by Anthropic as an agentic coding tool that reads your codebase, edits files, runs commands, and integrates with development tools. Its GitHub Actions workflow can respond to @claude mentions in issues and pull requests, and Claude Code now supports specialized subagents for task-specific workflows.

Codex CLI brings OpenAI’s coding agent into the terminal, GitHub Copilot CLI is now framed as a terminal-native agent with higher-autonomy modes, and Google’s Gemini CLI powers Gemini Code Assist agent mode. The terminal stopped being just a shell. It became an operating system for agents.

AI-native editors pushed the same logic further. Cursor describes itself as an AI editor and coding agent. Cursor Agent can complete complex tasks, run terminal commands, and edit code, while Cloud Agents run remotely and Automations can trigger agent work on schedules or events.

Windsurf’s Cascade combines planning, code edits, memories, workflows. The editor was no longer simply where humans wrote code. It became a coordination layer where humans supervise, redirect, and collaborate with agents.

Instructions and integrations became infrastructure

Once agents could act, organizations discovered a new problem. How do you make them act like your team? That is why instruction files and interoperability protocols became central.

Anthropic introduced MCP in late 2024 as an open standard for AI applications to connect to external tools and data sources. At the same time, repository instruction files like AGENTS.md gave coding agents a predictable place to find setup steps, testing commands, architectural guidance, and review expectations.

OpenAI’s docs say Codex reads AGENTS.md files before doing any work. Anthropic’s CLAUDE.md files and auto memory give Claude persistent project context. GitHub supports repository and organization custom instructions, and Cursor exposes persistent Rules. Prompt engineering had become something closer to infrastructure.

This matters historically because it marks another conceptual shift. In the early Copilot era, the prompt was mostly ephemeral: a comment, a function name, a cursor position. In the agent era, the durable instructions matter just as much as the transient request.

Teams now encode setup commands, testing rules, code style, escalation paths, and review standards in files that travel with the repository. That is a very different world from “predict the next line.” It is much closer to giving a new teammate an operating manual.

What the history actually shows

The cleanest way to describe this history is not as autocomplete getting smarter, but as the systematic decomposition of software engineering into machine-operable layers. First came code models. Then inline generation. Then conversation. Then codebase awareness. Then tool use. Then background execution. Then persistent memory. Then a layer of review and validation. Each breakthrough solved a bottleneck created by the one before it.

We began by teaching machines to predict code. We are ending, for the moment, by reorganizing software engineering around machines that can take goals, navigate systems, and produce working changes.

Interested in trying out AI code reviews? Get a free 14-day trial.