Nvidia Nemotron 3 Super: 120B Parameters, 12B Active, 1M Context

Nvidia's Nemotron 3 Super is a 120B hybrid Mamba-Transformer MoE model that activates only 12B parameters per token. Open weights, 1M context, 5x faster than its predecessor.

A 120-billion parameter model that only uses 12 billion of them at any given moment. That's the pitch for Nemotron 3 Super, and it's a good one — because the architecture behind that ratio is genuinely different from what the rest of the open-weight field is doing.

Released March 11, Nemotron 3 Super combines three ideas that haven't been merged at this scale before: a Mamba state-space backbone (not Transformer-only), a Mixture-of-Experts routing system called LatentMoE, and multi-token prediction for speculative decoding. The result is a model that runs 2.2x faster than comparably-sized Transformers while matching or beating them on reasoning benchmarks — and it's fully open-weight with a permissive commercial license.

Architecture: Why Mamba + MoE Matters

Most open-weight models — Llama, Qwen, Mistral — are pure Transformers with standard MoE bolted on. Nemotron 3 Super takes a different path. The bulk of its layers use Mamba, a state-space model architecture that processes tokens with roughly 4x better memory and compute efficiency than Transformer attention layers. Strategic global attention layers are placed throughout the stack specifically for tasks that need long-range reasoning.

The LatentMoE system is the other twist. Instead of routing tokens to experts in the full embedding dimension (expensive), it projects token representations into a compressed latent space, routes them to four active experts, then projects back. Same number of experts, fewer floating-point operations per token. Nvidia claims this delivers better accuracy per FLOP than standard MoE designs — and the benchmarks so far back that up.

The Numbers

Spec	Nemotron 3 Super	Qwen 3.5 122B	Llama 4 Maverick
Total params	120.6B	122B	400B
Active params	12.7B	~22B	~17B
Context window	1M tokens	128K tokens	1M tokens
Architecture	Hybrid Mamba-Transformer MoE	Transformer MoE	Transformer MoE
License	Nemotron Open (commercial)	Apache 2.0	Llama Community
API price (input/1M)	$0.30	$0.30	$0.20

The throughput gap is the story. Nvidia reports 7.5x higher inference throughput than Qwen 3.5 122B and 2.2x over GPT-OSS-120B at 8K input / 16K output — largely because Mamba layers don't have the quadratic attention cost that scales poorly with context length. At 1M tokens of context, Nemotron 3 Super outperforms both competitors on the RULER benchmark, which tests whether models can actually use their full context window rather than just claiming it.

Training at Scale

The training recipe is public — all of it. Two-phase pretraining on 25 trillion tokens total: 20 trillion diverse tokens for breadth, then 5 trillion high-quality tokens for benchmark accuracy. On top of that, 10 billion reasoning tokens and 15 million coding problems for post-training. Nvidia used their NVFP4 format (4-bit floating point optimized for Blackwell GPUs) for efficient training, and they're publishing the full methodology including all 15 reinforcement learning environments used during alignment.

Publishing the training recipe alongside the weights is uncommon. Meta publishes weights but not full training details. Nvidia is giving you both, plus the datasets — which makes Nemotron 3 Super arguably the most reproducible frontier-class open model released to date.

Where to Run It

Available now on HuggingFace (nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 and FP8 variants), Nvidia's own build.nvidia.com, and through NIM. Third-party hosting is already live on Google Cloud Vertex AI, Oracle Cloud, CoreWeave, Together AI, Baseten, Cloudflare, DeepInfra, Fireworks AI, Modal, OpenRouter, and Nebius — which is unusually broad availability for launch day.

At $0.30 per million input tokens and $0.82 per million output tokens, it's priced competitively with Qwen 3.5 and significantly cheaper than running the full 120B parameters would suggest. The 12.7B active parameter count means you get frontier-class reasoning at mid-tier inference costs.

Built for Agents

Nvidia is positioning Nemotron 3 Super explicitly for agentic AI — the kind of multi-step, tool-using workflows where a model needs to plan, execute, verify, and iterate. Their PinchBench score of 85.6% (best-in-class for agentic tasks) supports this. The combination of 1M context, fast inference, and strong tool-use performance makes it a natural fit for systems like NemoClaw and similar orchestration frameworks where the model is running continuously, not just answering one-off questions.

The 476.7 tokens-per-second median throughput across providers isn't just an academic number — it's the difference between an agent that takes 30 seconds to reason through a multi-step task and one that takes 8. For code review pipelines or automated research workflows, that speed compounds across hundreds of agent loops per day.

Nvidia spent $26B over five years to get here. With Nemotron 3 Super, they're not just making GPUs for other people's models anymore — they're competing directly in the open-weight model space, and the Mamba-Transformer hybrid architecture gives them a genuine technical differentiator that nobody else at this scale is shipping.