CodeRabbit logoCodeRabbit logo
AgentEnterpriseCustomersPricingBlog
Resources
  • Docs
  • Trust Center
  • Contact Us
  • FAQ
  • Reports & Guides
Log InGet a free trial
CodeRabbit logoCodeRabbit logo

Products

AgentPull Request ReviewsIDE ReviewsCLI ReviewsPlanOSS

Navigation

About UsFeaturesFAQSystem StatusCareersDPAStartup ProgramVulnerability Disclosure

Resources

BlogDocsChangelogCase StudiesTrust CenterBrand GuidelinesReports & Guides

Contact

SupportSalesPricingPartnerships

By signing up you agree to our Terms of Use and authorize CodeRabbit to provide occasional updates about products and solutions. You understand that you can opt out at any time and that your data will be handled in accordance with CodeRabbit Privacy Policy

discord iconx iconlinkedin iconrss icon
footer-logo shape
Terms of Service Privacy Policy

CodeRabbit, Inc. © 2026

CodeRabbit logoCodeRabbit logo

Products

AgentPull Request ReviewsIDE ReviewsCLI ReviewsPlanOSS

Navigation

About UsFeaturesFAQSystem StatusCareersDPAStartup ProgramVulnerability Disclosure

Resources

BlogDocsChangelogCase StudiesTrust CenterBrand GuidelinesReports & Guides

Contact

SupportSalesPricingPartnerships

By signing up you agree to our Terms of Use and authorize CodeRabbit to provide occasional updates about products and solutions. You understand that you can opt out at any time and that your data will be handled in accordance with CodeRabbit Privacy Policy

discord iconx iconlinkedin iconrss icon

Opus 4.8 benchmark results for AI code review and code generation

by
Juan Pablo Flores
Gowtham Kishore Vijay

Juan Pablo Flores

Gowtham Kishore Vijay

6 min read

6 min read

  • What’s new in Opus 4.8
  • What we tested
  • Results
  • Where it struggled
  • What this means for CodeRabbit users
Back to blog
Cover image

Share

https://victorious-bubble-f69a016683.media.strapiapp.com/Reddit_feecae8a6d.pnghttps://victorious-bubble-f69a016683.media.strapiapp.com/X_721afca608.pnghttps://victorious-bubble-f69a016683.media.strapiapp.com/Linked_In_a3d8c65f20.png

Cut code review time & bugs by 50%

Most installed AI app on GitHub and GitLab

Free 14-day trial

Get Started

Catch the latest, right in your inbox.

Add us your feed.RSS feed icon
newsletter decoration

Catch the latest, right in your inbox.

Add us your feed.RSS feed icon

Keep reading

Do you trust your AI Agent?

Do you trust your AI Agent?

Autonomous AI agents are everywhere. But without explainability, that autonomy never gets used on anything that matters. Here's the framework to fix that.

What's new in CodeRabbit Review: Code Peek, Chat Agent and more

What's new in CodeRabbit Review: Code Peek, Chat Agent and more

In the two weeks since CodeRabbit Review was released, a handful of features have been added to the mix. Here's a look at what's new.

CodeRabbit is now in the Claude Marketplace

CodeRabbit is now in the Claude Marketplace

Anthropic customers can now apply their existing Anthropic spend commitment toward CodeRabbit.

Get
Started in
2 clicks.

No credit card needed

Your browser does not support the video.
Install in VS Code
Your browser does not support the video.

Anthropic just shipped Opus 4.8. Before its release, we spent some time putting it through its paces, most of it on code review tasks. We ran it against our standard evaluation harness, watched how it behaves on real pull requests, and probed where it holds up and where it strains. Alongside that, we used it for the kind of long-running coding work that tends to break agents before they finish.

On review, it lands at parity with some of our tuned production ensemble. The surprise was how much it pulled ahead on code generation and long-horizon agentic sessions.

What’s new in Opus 4.8

Three things actually shipped, and everything else is downstream:

  • Long-horizon agentic execution. Performed well in tasks that span many tool calls without losing the thread. It plans before acting and holds the goal across hours-long sessions. Give it the full spec up front at high effort. Drip-feeding requirements perform noticeably worse. It completed more multi-hour, many-file sessions without dropping the thread than any model we've evaluated, and the same intermediate reasoning shows up in stronger code generation.
  • Mid-session system prompts. The messages array now accepts {"role": "system", ...} entries mid-conversation without invalidating prompt caches. The model follows them most reliably as context rather than overrides. It also narrates its plan, second-guesses, and requests more permission than prior Opus versions, all useful behaviors, but ones that require active budgeting and steering.
  • Tool-use recalibration. Web search triggers more often, but runs fewer rounds. Retrieval tools, sub-agents, and memory files trigger less often, defaulting to answering from context. The net effect is high-precision, low-recall behavior, steerable with an explicit instruction.

On code review, it lands at parity, with an actionable pass rate of 61% vs 62%, and full-system 72% vs 68% at unchanged precision. But the comment mix shifts and critical findings dipped (35 to 29), which gives us pause. In our “Results” section below, we dig into why, and whether it is recoverable.

CodeRabbit is integrating it selectively where its strengths fit, and routing other models that win on cost without sacrificing quality or pass rate.

https://youtu.be/LzgPzQud0zA

What we tested

We ran Opus 4.8 through the same harness we use for every model release: 100 open-source pull requests sampled across trivial, minor, and major complexity tiers. We compared two thinking configurations (a default escalating medium/high/x-high by tier, and a lower-thinking variant running low/medium/high) against a baseline running our current production model mix on the same PRs.

Two metrics drive the analysis: pass rate (the fraction of PRs on which the model surfaced the equivalent of what a senior human reviewer would flag) and precision (the fraction of comments that were actionable rather than noise). "Actionable" is adjudicated by senior reviewers.

Results

The default Opus 4.8 config edges past our baseline on full-system pass rate (+4pp, 72% vs 68%) and sits within noise on actionable pass rate (61% vs 62%). Precision holds at 33.8% on actionable comments and ticks up a point on the full system.

For a model going head to head with a tuned ensemble on a surface it was not specifically optimized for, that is a strong result, and the cross-file reasoning is clearest on senior-tier PRs.

Severity distribution bar chart compares Baseline, Opus 4.8 default, and Opus 4.8 One Thinking models. Data table displaying findings by severity for Baseline and Opus 4.8 models.

The comment mix is noisier than baseline, however. Major findings drop from 119 to 81, while minor and nitpick findings both roughly double. The model is shifting volume from the middle of the severity range toward the bottom.

The one result that gives us pause is critical findings, which fell from 35 to 29. For a code-review tool, missed criticals matter more than any other category of finding. Our working explanation is that Opus 4.8 follows review instructions literally. Consequently, conservative prompts ("only report high-severity issues") suppress recall more than they did with prior models, and that the higher-severity bug-finding capacity is real once the model is allowed to report broadly and we filter downstream rather than constraining at the source.

The lower-thinking variant tells a useful secondary story. Cutting reasoning effort drops precision four points and actionable pass rate five points. Thinking level is a first-class configuration decision.

We also found the default config costs more. We measured $0.20 to $0.28 per call against roughly $0.13 for Opus 4.5 and $0.04 to $0.12 for Sonnet 4.5. On code review alone, the model is at parity, making the premium hard to justify for review-only use. What earns its value is on long-horizon agentic and code-generation work below. That cost-versus-surface tradeoff is exactly why we route it selectively rather than everywhere.

Where it struggled

Performance degrades visibly once context crosses 200k tokens. The model slows and starts to miss references and edge cases it would have caught cleanly at lower context windows. This is an observational finding from hands-on use, not a controlled measurement. CodeRabbit's context engine works around this, but teams using Opus 4.8 directly will hit a wall in monorepos and large codebases.

What this means for CodeRabbit users

We are integrating Opus 4.8 selectively. Its strengths (cross-file reasoning, long-horizon agentic quality, planning under a single up-front spec) show up most on senior-tier changes. So that is where you will see it engaged. For trivial and junior-tier PRs, we continue routing to the models that win on cost and pass rate at those tiers. For our agentic features, we expect Opus 4.8 to be the strongest backbone we have integrated.

If you run Opus 4.8 directly, most existing Opus prompts work without modification. A few tune-ups produced measurable differences in our testing. Start at "high" thinking rather than "x-high" and test across tiers. Front-load the full task context for long-horizon work. Add an explicit search-first or delegation instruction to recover depth on research-heavy work. Drop conservative language from review prompts and filter downstream instead. Name the small decisions the model can make on its own.

We will keep evaluating as the model and our harness change. If the picture shifts, we will publish updated numbers.