CodeRabbit logoCodeRabbit logo
Issue plannerEnterpriseCustomersPricingBlog
Resources
  • Docs
  • Trust Center
  • Contact Us
  • FAQ
  • Whitepapers
Log InGet a free trial
CodeRabbit logoCodeRabbit logo

Products

Pull Request ReviewsIssue plannerIDE ReviewsCLI Reviews

Navigation

About UsFeaturesFAQSystem StatusCareersDPAStartup ProgramVulnerability Disclosure

Resources

BlogDocsChangelogCase StudiesTrust CenterBrand GuidelinesWhitepapers

Contact

SupportSalesPricingPartnerships

By signing up you agree to our Terms of Use and Privacy Policy

discord iconx iconlinkedin iconrss icon
footer-logo shape
Terms of Service Privacy Policy

CodeRabbit Inc © 2026

CodeRabbit logoCodeRabbit logo

Products

Pull Request ReviewsIssue plannerIDE ReviewsCLI Reviews

Navigation

About UsFeaturesFAQSystem StatusCareersDPAStartup ProgramVulnerability Disclosure

Resources

BlogDocsChangelogCase StudiesTrust CenterBrand GuidelinesWhitepapers

Contact

SupportSalesPricingPartnerships

By signing up you agree to our Terms of Use and Privacy Policy

discord iconx iconlinkedin iconrss icon

Claude Sonnet 4.5: Better performance but a paradox

by
David Loker

David Loker

October 03, 2025

6 min read

October 03, 2025

6 min read

  • Benchmark: What we looked for
  • Scoreboard - Sonnet 4.5 gets closer to Opus 4.1 in performance
  • Style and tone: Sonnet 4.5 is focused on hedging
  • What Sonnet 4.5 is good at
  • Sonnet 4.5: Hits a price vs. performance sweet spot
  • Sonnet 4.5 weaknesses
  • Sonnet 4.5 verdict
  • Closing thoughts
Back to blog
Cover image

Share

https://victorious-bubble-f69a016683.media.strapiapp.com/Reddit_feecae8a6d.pnghttps://victorious-bubble-f69a016683.media.strapiapp.com/X_721afca608.pnghttps://victorious-bubble-f69a016683.media.strapiapp.com/Linked_In_a3d8c65f20.png

Cut code review time & bugs by 50%

Most installed AI app on GitHub and GitLab

Free 14-day trial

Get Started

Catch the latest, right in your inbox.

Add us your feed.RSS feed icon
newsletter decoration

Catch the latest, right in your inbox.

Add us your feed.RSS feed icon

Keep reading

Article Card ImageArticle Card ImageArticle Card ImageArticle Card Image

Fix all issues with AI Agents – a quality of life improvement

Code review is where you catch the things you missed. Fixing them shouldn’t feel like Groundhog Day. CodeRabbit already flags issues in your pull requests and gives you ready-to-use prompts for your AI coding agents. You click Prompts for AI, copy th...

Article Card ImageArticle Card ImageArticle Card ImageArticle Card Image

Developers are dead? Long live developers.

Predictions about the end of programming are nothing new. Every few years, someone confidently announces that this time developers are truly finished. If you listened to these self-proclaimed Nostradamuses, devs were previously set to be replaced by ...

Article Card ImageArticle Card ImageArticle Card ImageArticle Card Image

Misalignment: The hidden cost of AI coding agents isn't from AI at all

TL;DR: The real cost of AI agents isn’t tokens or tools; it’s misalignment that shows up as rework, slop, and slowed teams. The conversation everyone is having (and why it misses the point) Most conversations about AI coding agents sound like a fant...

Get
Started in
2 clicks.

No credit card needed

Your browser does not support the video.
Install in VS Code
Your browser does not support the video.

Sonnet 4.5 is Anthropic’s newest Claude model and in our code review benchmark, it feels like a paradox: more capable, more cautious, and at times more frustrating. It catches bugs Sonnet 4 missed, edges closer to Opus 4.1 in coverage, and even surfaces a handful of unexpected critical issues off the beaten path.

Yet, it hedges, it questions itself, and it sometimes sounds more like a thoughtful colleague than a decisive reviewer. The data shows real progress:41.5% of its comments were Important in Sonnet 4.5 vs only 35.3% in Sonnet 4.But the tone and texture of those comments raise deeper questions about what we want in an AI reviewer.

And then there’s the kicker: Sonnet 4.5 gets you close to Opus-level performance at a fraction of the price, making it a pragmatic sweet spot for teams reviewing code at scale.

Sonnet 4.5 thinks aloud and still delivers decisive fixes but some of its comments are framed as vague “conditional” warnings that could make its comments harder for some to parse.. Let’s dive into our benchmark.

Benchmark: What we looked for

We evaluated Sonnet 4.5, Sonnet 4, and Opus 4.1 across 25 difficult real-world pull requests containing known critical bugs (ranging from concurrency and memory ordering to async race conditions and API misuse). A model “Passed’ a PR if it produced at least one comment directly on the critical issue.

We measured coverage (S@25), precision (comment PASS rate), and signal-to-noise ratio. For signal-to-noise we focus on Important comments (these are the comments that matter most). They include:

  • PASS comments that correctly addressed the known critical bug in the PR.

  • Other important comments that did not solve the tracked issue, but still flagged a truly Critical or Major bug elsewhere.

Scoreboard - Sonnet 4.5 gets closer to Opus 4.1 in performance

The results were mixed:

  • Coverage: Sonnet 4.5 closes much of the gap between Sonnet 4 and Opus 4.1 and lands far ahead of Sonnet 4.

  • Precision: Opus 4.1 still produces the cleanest, most reliable actionable comments but that is to be expected given that it’s a more expensive model.

  • Important share (i.e. percentage of comments flagging a significant issue): With stricter criteria, Sonnet 4.5 lands at just over 41% Important share. That means about 4 in 10 of its comments either solved the key bug or flagged another truly significant issue. Opus 4.1 leads here at 50%, with Sonnet 4 at ~35%.

Style and tone: Sonnet 4.5 is focused on hedging

Sonnet 4.5’s comments patch the code but do so in a less confident tone than Opus 4.1 does but is still more confident than Sonnet 4.

Patches present:

  • 87% of Sonnet 4.5’s actionable comments included a code block or diff patch, similar to Sonnet 4 (90%) and Opus 4.1 (91%).

  • The difference is in style: Opus’s diffs read like surgical fixes, while Sonnet 4.5 often couches them in exploratory text. It “suggests” or “considers” changes rather than asserting them.

Hedging language:

  • Sonnet 4.5 hedges in 34% of actionable comments—words like might, could, possibly. For example:

    • “Unnecessary allocation: cache is never used. The constructor allocates 4KB of memory that is never utilized … Consider removing the cache_buffer.”

    • “Remove the empty try/except block. … likely a placeholder”

  • Opus 4.1 is steady at ~28%. Sonnet 4 sits slightly lower at ~26%.

  • This hedging creates an “interrogative” tone: Sonnet 4.5 sometimes feels like it’s thinking out loud with you, rather than delivering verdicts.

Confident language:

  • Sonnet 4.5 balances that hedging with higher confidence markers (39%) than Sonnet 4 (18%) or Opus 4.1 (23%). For example:

    • “Critical: Missing self. prefix breaks all API methods. All subsequent methods will raise AttributeError until this is corrected.”

    • “Potential integer overflow. optimization_cycle_count increments unbounded … this will overflow after ~414 days of runtime.”

  • In other words, it swings between caution and certainty more dramatically.

Signal-to-noise:

  • Sonnet 4.5 improved precision over Sonnet 4, but still produced more “minor” off-target notes than Opus.

However, when you count its true Important comments—PASS comments plus a small number of high-confidence off-EP issues—it lands at 41.5% Important share. Opus 4.1 is still the gold standard Anthropic model at ~50%.

What Sonnet 4.5 is good at

Across the PRs we tested Sonnet 4.5 with, we saw some clear areas where it stood out.

  • Concurrency bug-finding: Sonnet 4.5 nailed C++ atomics and condvar misuses with clean, actionable diffs.

  • Consistency checks: It reliably flagged distributed state mismatches across services.

  • Extra bug surfacing: It did identify additional Critical issues not originally under evaluation, though fewer than initially expected under a stricter rubric.

As Anthropic markets Sonnet 4.5, they emphasize “hybrid reasoning” and “long horizon” planning. In practice, that shows up as more willingness to chase down side-paths in the code and note real but untracked issues.

Sonnet 4.5: Hits a price vs. performance sweet spot

One of the biggest advantages of Sonnet 4.5 is its price-to-performance ratio. While Opus 4.1 remains Anthropic's flagship model in raw capability, it also comes at a significantly higher cost.

Sonnet 4.5 narrows the gap in coverage and important bug-finding while staying far more cost-efficient to run. For many teams, that balance of having close to Opus-level results at a fraction of the price is what makes Sonnet 4.5 the most pragmatic choice.

Sonnet 4.5 weaknesses

But if using Sonnet 4.5, it’s critical to be aware of its weaknesses. These include:

  • Deadlock coverage: Like Sonnet 4 and even Opus, it still struggles to trace complex lock ordering.

  • Verbosity and hedging: Many comments run long, caveated, or uncertain. Compare this to GPT-5 Codex, which in our earlier work wrote comments that “read like patches” with crisp directness. For example, with GPT-5 Codex:

    • Lock ordering / deadlock: Reorder the lock acquisitions to follow a consistent hierarchy. This prevents circular wait deadlocks.”

    • Regex catastrophic backtracking: “Remove the nested quantifier to avoid catastrophic backtracking.”

  • Precision gap: At 35% comment-level precision and 41.5% important share percentage, it’s better than Sonnet 4 but well short of Opus 4.1.

Sonnet 4.5 verdict

Sonnet 4.5 feels less like a teacher writing in red pen and more like a thoughtful colleague at your side: pointing out possible issues, often right, occasionally over-hedged, and sometimes spotting things you didn’t know were there.

That style is a double-edged sword in review. On one hand, developers may appreciate the extra critical issues it flags. On the other, when the task is “please catch this bug,” Opus 4.1 is still sharper.

Closing thoughts

Anthropic positioned Sonnet 4.5 as a step toward agentic reasoning and computer use. In code review, that reasoning shows up in richer, more cautious, and more wide-ranging comments.

For teams:

  • If you value decisive, patch-like feedback, Opus 4.1 (or GPT-5 Codex) still sets the bar.

  • If you want a reviewer that finds critical issues anywhere they lurk, even beyond the tracked bug, Sonnet 4.5 has surprising upside.

  • And if you care about pragmatic price-to-performance, Sonnet 4.5 may be the smartest choice: close to Opus’s accuracy at a fraction of the cost.

Either way, Sonnet 4.5 changes the texture of reviews. It feels more human—not always cleaner, but more inquisitive, more hedged, sometimes more right in the places you weren’t looking.