CodeRabbit logoCodeRabbit logo
AgentEnterpriseCustomersPricingBlog
Resources
  • Docs
  • Trust Center
  • Contact Us
  • FAQ
  • Reports & Guides
Log InGet a free trial
CodeRabbit logoCodeRabbit logo

Products

AgentPull Request ReviewsIDE ReviewsCLI ReviewsPlanOSS

Navigation

About UsFeaturesFAQSystem StatusCareersDPAStartup ProgramVulnerability Disclosure

Resources

BlogDocsChangelogCase StudiesTrust CenterBrand GuidelinesReports & Guides

Contact

SupportSalesPricingPartnerships

By signing up you agree to our Terms of Use and authorize CodeRabbit to provide occasional updates about products and solutions. You understand that you can opt out at any time and that your data will be handled in accordance with CodeRabbit Privacy Policy

discord iconx iconlinkedin iconrss icon
footer-logo shape
Terms of Service Privacy Policy

CodeRabbit, Inc. © 2026

CodeRabbit logoCodeRabbit logo

Products

AgentPull Request ReviewsIDE ReviewsCLI ReviewsPlanOSS

Navigation

About UsFeaturesFAQSystem StatusCareersDPAStartup ProgramVulnerability Disclosure

Resources

BlogDocsChangelogCase StudiesTrust CenterBrand GuidelinesReports & Guides

Contact

SupportSalesPricingPartnerships

By signing up you agree to our Terms of Use and authorize CodeRabbit to provide occasional updates about products and solutions. You understand that you can opt out at any time and that your data will be handled in accordance with CodeRabbit Privacy Policy

discord iconx iconlinkedin iconrss icon

Nemotron 3 Ultra makes the case for fast, open coding models

by
Juan Pablo Flores
Basem Rizk

Juan Pablo Flores

Basem Rizk

June 04, 2026

11 min read

June 04, 2026

11 min read

  • What we know about Nemotron 3 Ultra
  • What is different this time
  • CodeRabbit Benchmark Performance
  • Where Ultra looks strong for developers
  • What developers should watch out for
  • Verdict
Back to blog
Cover image

Share

https://victorious-bubble-f69a016683.media.strapiapp.com/Reddit_feecae8a6d.pnghttps://victorious-bubble-f69a016683.media.strapiapp.com/X_721afca608.pnghttps://victorious-bubble-f69a016683.media.strapiapp.com/Linked_In_a3d8c65f20.png

Cut code review time & bugs by 50%

Most installed AI app on GitHub and GitLab

Free 14-day trial

Get Started

Catch the latest, right in your inbox.

Add us your feed.RSS feed icon
newsletter decoration

Catch the latest, right in your inbox.

Add us your feed.RSS feed icon

Keep reading

CodeRabbit now supports NVIDIA Nemotron 3 Ultra

CodeRabbit now supports NVIDIA Nemotron 3 Ultra

NVIDIA Nemotron 3 Ultra delivers accurate and fast throughput in CodeRabbit's self-hosted AI code reviews.

Why your internal AI code review tool will cost more than you think

Why your internal AI code review tool will cost more than you think

The prototype is the easy part. Here's what engineering teams consistently underestimate when they build AI code review internally, with cost benchmarks across three org sizes.

You’re addicted to AI code generation. Now what?

You’re addicted to AI code generation. Now what?

Developers distrust AI coding tools just enough to double-check the output, yet rely on them too much to turn them off. Here's what that dependency is actually costing engineering teams, and how to build review systems that keep up with it.

Get
Started in
2 clicks.

No credit card needed

Your browser does not support the video.
Install in VS Code
Your browser does not support the video.

NVIDIA Nemotron 3 Ultra does not feel like another model built primarily for a chat window. The first question is not whether it can win a leaderboard, but whether it can fit into the way developers actually use models now: inside terminals, review pipelines, coding agents, test generators, and workflows where the model has to keep moving through messy context.

NVIDIA is releasing a large open model with roughly 550 billion total parameters and about 55 billion active per token, but the real pitch is speed plus control. If a model is fast enough, a developer can stay in the loop. A system can retry it. A coding harness can keep it working until the task is actually finished.

Ultra is not the model I would frame as "the new best coding assistant." It points toward a world where open models become fast, controllable workers inside developer systems, not just chat interfaces waiting for the next prompt.

For workflows where the model is one part of a larger loop, Nemotron 3 Ultra becomes especially relevant: code review, test generation, repository research, agentic coding, and internal automation where teams care about speed, control, and where the model runs.

Scatter plot showing AI intelligence index versus output speed, with a most attractive quadrant highlighted.

What we know about Nemotron 3 Ultra

Nemotron 3 Ultra is the largest model in NVIDIA's Nemotron 3 family. The family includes Nano, Super, and Ultra, all designed around agentic AI applications. Ultra is the big reasoning engine in that lineup: roughly 550 billion total parameters, with about 55 billion active per token through a sparse mixture-of-experts design.

The cleanest comparison is with Nemotron 3 Super, the previous large model in the family.

CharacteristicNemotron 3 SuperNemotron 3 Ultra
Role in the familyHigh-throughput reasoning model for agentic workflowsLargest Nemotron 3 reasoning model for more complex coding, research, and enterprise workflows
Total parameters120B550B
Active parameters12B active per token55B active per token
ArchitectureHybrid Mamba-Transformer MoEHybrid Mamba-Transformer MoE
Expert designLatent MoELatent MoE
Context lengthUp to 1M tokensUp to 1M tokens
Efficiency featuresMulti-token prediction and NVFP4 training/deployment pathMulti-token prediction and NVFP4-oriented deployment path
Best fitHigh-volume agentic workflows, coding, planning, and tool useMore demanding developer workflows where speed, scale, and stronger reasoning need to sit in the same loop

In simpler terms: this is not just a bigger dense Transformer. Ultra is built to activate only part of the network per token, keep long context practical, and produce tokens quickly enough that developers can use it interactively instead of treating it like a slow background batch job.

Flowchart detailing Nemotron 3 Ultra's fast architecture: long context, Mamba-Transformer, MoE routing, multi-token prediction.

The launch numbers put Ultra in a strong spot. Artificial Analysis reported Nemotron 3 Ultra at 48 on its Intelligence Index, making it the leading US open-weight model in that snapshot, ahead of Gemma 4 31B, Nemotron 3 Super, and gpt-oss-120b. Kimi K2.6 still sits higher at 54, so the claim is not that Ultra owns the entire open frontier. The claim is that it is unusually fast for the intelligence level it reaches.

Artificial Analysis also reported more than 300 output tokens per second on a pre-release DeepInfra endpoint. For developers, that speed is the useful part. In coding, latency changes behavior. If a model is slow, you fire and forget. If it is fast, you stay in the loop, ask follow-ups, run multiple attempts, and let an agentic harness keep pushing.

NVIDIA Nemotron 3 Ultra announcement slide showing a cost comparison graph presented by a speaker.

What is different this time

Nemotron 3 Super already showed that NVIDIA could build a capable open model for agentic workflows. Ultra pushes further in two ways.

First, it is much bigger. Super is around 120B total parameters with roughly 12B active. Ultra moves to roughly 550B total and 55B active. That extra scale shows up in the way NVIDIA and early testers talk about it: not as a small efficient helper model, but as a model that can start taking work from proprietary frontier systems in selected workflows.

Second, Ultra appears to have been trained and evaluated with developer harnesses more directly in mind. NVIDIA mentions that Super turned out to be good in agentic harnesses, while Ultra was built with those harnesses in mind. For coding tools, that changes the requirements. A model that works well in OpenCode, OpenHands, Kilo Code, Continue, or an internal code review loop has to do more than answer questions. It has to follow tool protocols, manage long context, make progress under repeated prompts, and recover when it gets stuck.

Ultra's behavior fits that target. The model is quick, direct, not especially verbose, and unlikely to ask for lots of clarification. That can be a strength in a harness, but a weakness if the task depends on unstated requirements. It benefits from explicit instruction. The best mental model is closer to Codex-style prompting than Claude-style prompting. Spell out the task. Give acceptance criteria. State the expected output format.

CodeRabbit Benchmark Performance

CodeRabbit's internal benchmark gives a more grounded view than launch charts alone. The benchmark compares a baseline set of review models against a Nemotron 3 Ultra configuration across 105 evaluation problems, ranging from easier issues to harder review tasks. The evaluation uses post-pipeline final comments after verification, deduplication, and assertive filtering. The judge was gpt-5.1 with medium reasoning, low verbosity, single mode, and three votes.

CodeRabbit benchmark table compares Baseline average and Nematron 3 Ultra performance metrics.

The top-line result is close:

  • Baseline average, N=3: 60/105 pass actual, or 57 percent
  • Nemotron 3 Ultra average, N=2: 58/105 pass actual, or 56 percent
  • Baseline pass full: 66/105, or 63 percent
  • Nemotron 3 Ultra pass full: 65/105, or 62 percent
  • Baseline precision actual: 34.0 percent
  • Nemotron 3 Ultra precision actual: 33.0 percent

The positive read: on this review workload, Ultra was roughly in the same band as the baseline on pass metrics. It found real issues, survived the review pipeline, and produced useful CodeRabbit-style comments.

The caveat is reliability. The model had a high retry rate. The benchmark summary shows an average of 36.5 retries for the Ultra runs, compared with 0.3 for the baseline. The retry distribution notes that about 66 percent were scratchpad-only. In practice, the model sometimes voluntarily stops before producing the required output marker or final structured output. Retrying without changing the prompt often works, which suggests the capability is there, but the first-attempt completion behavior is not stable enough to ignore.

The practical finding from the CodeRabbit data is clear: Nemotron 3 Ultra can do the work, but it should be wrapped in validation and retry logic for structured-output tasks.

There is also an interesting latency signal. In the benchmark, the Ultra run shows a mean latency of 7:06 per full review trace, compared with 8:31 for the baseline. That is not an enormous difference in this specific report, but the Ultra runs were carrying a large retry burden and still remained competitive on time. NVIDIA's framing around Ultra repeatedly returns to the same idea: if the model is fast enough, several attempts can still beat one slower, more careful attempt.

The cost story is less clean in the benchmark. The reported total cost for the Ultra run is higher than the baseline in this specific table. That should not be over-generalized, because internal fallback rates, hosted endpoint pricing, and retry behavior can dominate a local experiment. The public NVIDIA and Artificial Analysis story is about cost-to-completion and throughput. The CodeRabbit results say something narrower: on this benchmark, quality was close, speed was competitive, and the reliability control loop needs work.

Where Ultra looks strong for developers

The strongest use case for Nemotron 3 Ultra is not "replace every coding model." It is "run a lot of useful developer work quickly, with explicit instructions and external checks."

It looks promising for:

  • Code review pipelines where comments can be verified, filtered, deduplicated, and retried
  • Integration test generation, especially when the model needs to read broad context
  • Repository research tasks that require scanning many files or documents
  • Agentic workflows where a harness can keep the model moving until the task is complete
  • Everyday coding tasks that benefit from fast iteration more than perfect one-shot reasoning

NVIDIA also shared a useful example: Ultra was used in OpenCode to read several papers and reason across them. That is not a PhD-level coding challenge, but it is exactly the kind of everyday developer task where speed changes the workflow. You can stay in the terminal, watch the model move, and keep steering.

For CodeRabbit-style work, the model also seems especially interesting on easier and medium-difficulty review tasks. These are still valuable reviews: the system needs to catch practical issues, explain them clearly, and produce a lot of review output without waiting on a more expensive frontier model every time.

What developers should watch out for

Ultra needs structure. If you are using it for coding or developer automation, do not treat it as a free-form chat model and hope it infers the workflow. Give it a harness. Give it a checklist. Give it stop conditions. Give it output validation.

Practical guidance:

  • Use explicit prompts with concrete acceptance criteria.
  • For structured output, validate the required markers or schema before accepting the response.
  • Add retry logic for premature stops.
  • Use goal loops or external completion checks so the model keeps working until the task is actually done.
  • Ask for tests explicitly. In early hands-on use, the model did not always generate its own tests.
  • Be specific about design requirements. It can produce better visual artifacts than expected, but design is not its core strength.
  • Prefer it for high-throughput workflows where several attempts are acceptable.
  • Be cautious for workflows where a single malformed output can break production automation.

This model also changes how teams should think about benchmarking. A pure one-shot benchmark may underrate Ultra if the real product loop allows retries. A benchmark that ignores retries may overrate it if the product needs strict first-attempt formatting. The right metric is probably closer to time-to-usable-completion, with quality, retries, latency, and cost all measured together.

Verdict

Nemotron 3 Ultra is one of the most interesting open model releases for developers because it is not only chasing intelligence. It is chasing usable throughput.

The model is big, open, and fast. Public benchmarks put it near the top of US open-weight intelligence while keeping it far ahead of many peers on output speed. CodeRabbit's benchmark adds a more sober picture: Ultra can perform close to a strong review baseline, but it currently needs retries and external validation for structured-output reliability.

The verdict is nuanced. If you want a model that will nail every strict format on the first try, Ultra is not yet the safest default. If you are building an agentic developer system where the harness can validate, retry, and keep pressure on the model until the work is complete, Ultra becomes much more compelling.

For coding teams, the bigger story is not whether Nemotron 3 Ultra replaces a favorite chat model. It is whether open, high-throughput coding agents are starting to feel practical.

Try it on CodeRabbit PR reviews now and let us know your thoughts.