Current AI development is shifting from conversational chatbots toward agentic systems designed to take actions and execute tasks.
A production-grade agent requires an internal architecture that differs from a typical conversational assistant. It must plan multi-step workflows, call external tools reliably, and maintain focus over massive context windows without drifting from the original goal. While general-purpose models emphasize conversational quality, the emerging challenge is building systems that can reliably operate within the constraints of real-world workflows and infrastructure.
NVIDIA’s release of the Nemotron 3 family reflects a deliberate move toward models designed to support agent-style workflows. By moving away from a one-size-fits-all approach, NVIDIA has introduced a specialized ecosystem, comprising Nano, Super, and Ultra, designed to serve as a high-throughput reasoning engine for autonomous clusters in fields such as cybersecurity, software development, and manufacturing.
Feature | Nemotron 3 (Nano/Super/Ultra) | Frontier Models (Gemini/GPT) |
Primary Strength | Inference throughput and agentic speed: Up to 3.3x higher throughput than similarly-sized models for heavy generation tasks. | Reasoning depth and multimodality: Higher scores on graduate-level logic and complex “Deep Think” reasoning tasks. |
Context Handling | Native 1M-token window enabled by Mamba, avoiding the high costs of pure attention at scale. | Massive windows (1.2M - 2M) but often at higher per-token costs or with higher latency for minor fixes. |
Optimization | Built for multi-agent systems: Low-latency, high-volume IT ticket automation, and IT workflow orchestration. | Built for interactive, human-AI chat: Polished communication and stable behavior for IDE assistants. |
Deployment | Open-weights and controllable: Developers can customize, optimize, and deploy on private infrastructure for maximum security. | Closed-source API: High dependability and ease-of-use, but with limited control over underlying infrastructure. |
This newsletter presents a detailed technical analysis of the Nemotron 3 family. It examines the hybrid architecture enabling million-token context windows, analyzes the Latent MoE mechanism for compressing reasoning into hardware-efficient latent spaces, and reviews the training approaches—such as RLVR and SteerLM—used to prioritize factual consistency over surface-level plausibility.
To understand why this new architecture is necessary, we must examine the limitations of the current dominant paradigm, retrieval-augmented generation (RAG). In this traditional workflow, a user poses a query, a retrieval system fetches relevant documents from a vector database, and a large language model (LLM) synthesizes an answer based on that retrieved context.
While effective for simple fact-finding, this pipeline is inherently reactive. It acts like a librarian who can only fetch the book you explicitly asked for, rather than a researcher who knows that to answer your question, they must verify the source, cross-reference citations, and formulate a new search strategy.
While traditional RAG (Left) suffers from context overflow and linear execution, Nemotron 3 (Right) utilizes a hybrid Mamba-Transformer backbone to maintain a 1-million-token “working memory,” enabling the recursive reasoning loops required for autonomous agents
This limitation has driven the shift toward agentic systems that operate through iterative control loops rather than linear execution paths. However, this agency imposes a crushing computational burden. Agents require massive working memory and inference speeds that make multi-step thinking loops economically viable. A standard Transformer, with its quadratic attention complexity
It is within this friction, between the desperate need for agentic reasoning and the computational limits of the Transformer, that Nemotron 3 operates. This report provides an exhaustive technical analysis of the family, dissecting the hybrid Mamba-Transformer architecture, the novel latent MoE mechanism, and the rigorous RLVR training that aligns these models with truth.
Understanding the significance of Nemotron 3 requires context around the engineering constraints that emerged in 2024. As context windows expanded to 128k and beyond, the attention bottleneck became the primary inhibitor of performance. In a standard Transformer, every token must attend to every other token. If an agent is processing a 100,000-token legal repository, the model must calculate interaction scores for
NVIDIA’s solution in Nemotron 3 is a hybrid architecture that integrates three distinct paradigms into a single, cohesive backbone:
State space models (Mamba-2)
Attention mechanisms (Transformer)
Mixture-of-experts (MoE)
The primary workhorse of the Nemotron 3 architecture is the Mamba-2 layer. Mamba belongs to a class of architectures known as state space models (SSMs). Unlike Transformers, which keep the entire history of tokens accessible at all times (requiring massive memory), SSMs compress context into a fixed-size hidden state.
Mathematically, Mamba-2 operates with linear complexity (
In the context of an agentic workflow, Mamba acts as the peripheral nervous system or the working memory buffer. It efficiently processes large volumes of raw data, such as logs, document streams, and lengthy chat histories, while maintaining a coherent state representation of prior events with minimal memory overhead. This efficiency is the key enabler for the model’s native support of 1 million token context windows, a feat that would require astronomical compute resources with a pure Transformer architecture.
If Mamba is so efficient, why not eliminate Transformers? The answer lies in the specific cognitive requirements of recall. While state space models are excellent at tracking the general state and flow of information, they struggle with the precise, high-fidelity recall of specific details buried deep in the context, known as the “Needle in a Haystack” problem.
To address this, Nemotron 3 selectively intersperses Transformer attention layers throughout the network. These layers utilize grouped-query attention (GQA) to minimize their own memory footprint, but their primary function is to perform “all-to-all” information routing. When the model encounters a task requiring precise logical deduction or cross-referencing a variable definition from 50,000 tokens ago, the attention layers activate to look back across the entire sequence with perfect clarity.
The synergy is distinct: Mamba handles flow and context maintenance (approximately 90% of the workload), while Attention handles critical reasoning and precise retrieval (approximately 10% of the workload). This architectural decision mirrors the human brain’s duality of subconscious processing (fast, efficient, always on) vs. conscious focus (intense, expensive, selective).
The third pillar of the architecture is the mixture-of-experts (MoE) design. In a traditional dense model (such as Llama 3 70B), every parameter in the neural network is activated for every token generated. If you have 70 billion parameters, you do 70 billion calculations per token. This is incredibly inefficient, as a token representing the word the does not require the same computational power as a token representing a complex mathematical derivative.
Nemotron 3 Nano employs a granular MoE architecture. Instead of a single massive feed-forward network, the model’s knowledge is fractured into 128 smaller, specialized expert networks.
Total parameters: 31.6 billion
Active parameters: ~3.2 billion (per token)
Routing: For each token, a learned router network selects the top 6 most relevant experts to process the information.
Shared experts: In addition to the 6 routed experts, 2 shared experts are always active to maintain foundational knowledge and consistency across the model.
This design allows Nemotron 3 Nano to possess the knowledge capacity of a 30B+ model (capable of storing vast facts about world history, coding syntax, and science) while executing inference at the speed of a tiny 3B model. For edge deployment and real-time agents, this is a transformative development. It means a developer can run a model with frontier-class knowledge on a single consumer GPU (like an RTX 4090) or a high-end laptop, enabling sovereign AI implementations where data never leaves the local device.
The Nemotron 3 Nano model consists of 52 total layers. The pattern is not random; it is engineered to balance the compression of Mamba with the precision of attention.
The layer sequence follows a specific repeating block structure:
Mamba-2 block (x5): Five consecutive layers of Mamba-2 integrated with MoE. This provides a deep stack of efficient sequence modeling to compress the incoming data.
Hybrid block A (x3): Three layers alternating between Mamba-2 and attention (both with MoE). This injects precise attention mechanisms to refine the state representation.
Mamba-2 block (x1): A single Mamba layer to transition the state.
Hybrid block B (x4): Four layers of alternating Mamba/Attention.
This structure implies that the model spends the majority of its depth (the Mamba blocks) efficiently processing the gist of the context, while the clustered Attention blocks serve as checkpoints to realign and sharpen the model’s focus. This design enables the model to achieve 3.3 times higher throughput than comparable dense models, such as Qwen-3, while maintaining superior accuracy on long-context benchmarks.
While the Nano model represents the state-of-the-art for edge efficiency, the upcoming super and ultra models (slated for 2026) introduce two additional technologies that push the boundaries of what is possible in server-side AI: Latent MoE and multi-token prediction (MTP). These features address the scalability limits of MoE and the sequential nature of text generation.
In standard MoE models (including Nemotron 3 Nano), the routing of tokens to experts happens in the model’s full hidden dimension (ddd). If the model’s hidden dimension is large (e.g., 4096 or 8192 numbers representing the token), moving this data to the chosen experts, which might physically reside on different GPUs in a data center cluster, creates a massive communication bottleneck. The “all-to-all” communication required to dispatch tokens and aggregate results becomes the primary drag on training and inference speed.
Latent MoE solves this by introducing a compression step into the routing process.
Down-projection: Before routing, the high-dimensional token vector (
Latent execution: The router selects experts based on this compressed latent representation, and the experts themselves operate within this efficient latent space.
Up-projection: The output from the experts is projected back up to the original dimension
This mechanism is effectively a reasoning compression algorithm. By reducing data movement by 75% (4x compression), NVIDIA frees up a massive amount of memory bandwidth. Crucially, they do not pocket these savings solely for speed; instead, they reinvest them to enhance the model’s intelligence. The saved bandwidth enables the architecture to support four times more experts and four times more active experts per token, at the same computational cost as a standard MoE.
This allows the super and ultra models to achieve unprecedented levels of expert specialization. Instead of having a generic coding expert, the latent MoE architecture can afford to support highly specific experts, such as a Python pandas dataframe expert, a Rust memory safety expert, or a React Hook expert. This granularity is essential for enterprise agents that must navigate complex, domain-specific nuances without hallucinating.
The second frontier capability is multi-token prediction (MTP). Standard large language models are auto-regressive: they function like a person typing with one finger, predicting one character (or token) at a time, looking at what they have just written, and then predicting the next. This serial process (
MTP fundamentally changes the model’s objective function. Instead of predicting just the next token (
Impact on reasoning: During training, this forces the model to plan. To successfully predict the 4th word in a sentence before writing the 2nd, the model must have a stronger internal representation of the logic and causality of the statement. Ablation studies show that this training objective improves model accuracy on reasoning benchmarks (such as GSM8K and MMLU-Pro) by approximately 2.4%, primarily because the model has learned to think in larger chunks rather than focusing myopically on the next word.
Impact on inference (speculative decoding): During inference, MTP enables a significant speedup through speculative decoding. The model predicts 4 tokens at once. These are treated as drafts. A lightweight verification step checks if these drafts make sense. Because the model is highly accurate (achieving ~97% acceptance rate for the first two tokens), the system can frequently accept multiple tokens in a single compute cycle. This effectively decouples generation speed from memory bandwidth, allowing the Ultra model to generate text at speeds previously reserved for much smaller models.
A model is only as good as the data it consumes and the feedback it receives. The Nemotron 3 training pipeline represents a shift from learning to mimic human speech (standard large language model, or LLM, training) to learning to be correct (agentic training).
The foundation of Nemotron 3 Nano is a massive pre-training corpus of 25 trillion tokens. For context, this is significantly larger than the datasets used for Llama 3 (~15T). The dataset includes 3 trillion new unique tokens compared to the previous Nemotron 2 generation.
This dataset is not a uniform blob of internet text. It is a carefully curated mixture designed to foster agentic capabilities:
Common crawl code: 428 billion tokens of code, crucial for logic and tool use.
Synthetic data: A heavy reliance on synthetic data generated by stronger teacher models (like Nemotron 4 340B). This synthetic data is used to create perfect examples of reasoning chains, ensuring the model learns correct logic rather than just reproducing human errors found on the web.
Long-context phase (LC-phase): The training includes a specific final phase where the model is exposed to high-quality, extremely long documents (up to 512k tokens) to stretch its attention span and validate the Mamba memory mechanism.
The most critical innovation in the post-training phase is the adoption of reinforcement learning from verifiable rewards (RLVR), moving beyond the industry-standard reinforcement learning from human feedback (RLHF).
The flaw of RLHF: RLHF relies on human annotators, or proxy models, to rank pairs of model outputs. This approach works well for creative writing or conversational tasks, where quality is inherently subjective. In contrast, for agents executing code or solving mathematical problems, preference is largely irrelevant because outcomes are objectively correct or incorrect. As a result, RLHF can encourage sycophantic behavior, where models produce polite and plausible responses that align with human preference signals but are ultimately incorrect.
The precision of RLVR: RLVR replaces the “vibe check” of humans with programmatic verification.
Math: The model generates a solution. A Python solver checks if the final number is correct. If yes, +1 reward. If no, -1 reward.
Code: The model writes a function. A sandbox compiles the code and runs unit tests. If it passes, +1 reward.
Tool use: The model generates a SQL query. The training environment executes the query against a mock database. If it returns valid data without syntax errors, +1 Reward.
By training against ground truth rather than human preference, Nemotron 3 aligns itself with objective reality. This is why the model achieves such startlingly high scores on math benchmarks, such as AIME 2025 (99.2% with tool use); it has been rigorously conditioned to verify its own work.
While RLVR handles correctness, SteerLM handles style and behavior. Traditional alignment bakes in a specific personality (usually a helpful, harmless assistant). If a developer wants a terse, technical output for a CLI tool or a verbose, empathetic output for a customer support bot, they typically have to fine-tune the model or use complex prompting.
SteerLM changes this by training the model to understand “attribute vectors.” During training, data is annotated with scores for attributes like quality, humor, creativity, and verbosity.
At inference time, the developer can simply dial in the desired personality by passing these values as metadata.
Legal agent: Creativity: 0, Verbosity: 8, Formal: 10.
Creative writer: Creativity: 10, Humor: 7, Formal: 2.
This allows a single deployed instance of Nemotron 3 to serve hundreds of different downstream applications, each with a distinct voice, significantly reducing deployment costs for enterprises.
The empirical data support the theoretical advantages of the hybrid architecture and RLVR training. Nemotron 3 Nano, despite its small size (30B total, ~3.2B active), outperforms significantly larger models, challenging the “Scale is All You Need” dogma.
The benchmark results position Nemotron 3 Nano not as a competitor to other small models (such as Llama 3 8B), but rather as a rival to frontier-class models like GPT-4 and Qwen-2.5 72B in specific reasoning domains.
Benchmark | Category | Nemotron 3 Nano (30B) | Qwen-3 (30B-A3B) | Llama 3.1 70B | GPT-4o (Est.) |
AIME 2025 | Adv. Math (w/ Tools) | 99.2% | ~82.3% | ~95.1% | ~96% |
GPQA Diamond | PhD-Level Science | 75.0% | 49.0% | 66.7% (Super) | ~53% |
MMLU-Pro | General Reasoning | 77.5% | 71.6% | 68.9% | ~75% |
IFBench | Instruction Following | 99.2% | 82.6% | 92.1% | N/A |
RULER | 1M Context Retrieval | 68.2% | ~40% (at 128k) | Fail > 128k | N/A |
The disparity in AIME 2025 scores is particularly telling. Achieving 99.2% on advanced math problems suggests that Nemotron 3 has essentially solved this domain when allowed to use tools (Python). This validates the RLVR approach: the model doesn’t just guess the number; it writes a Python script to calculate it, executes the script, and reports the result. It is acting as an agent, not a text generator.
The RULER benchmark tests a model’s ability to retrieve specific needles of information from massive haystacks of text at varying context lengths. Most Transformer models experience a catastrophic drop in performance when the context exceeds 128k tokens, due to distraction and attention dilution.
Nemotron 3 Nano, utilizing its Mamba-2 memory backbone, maintains an accuracy of 68.2% even at the extreme length of 1 million tokens. This confirms that the linear-time processing of Mamba does not come at the cost of recall. The model effectively compresses the “middle” of the context while keeping it accessible for the attention layers to query. This capability is game-changing for enterprise RAG, allowing companies to feed entire technical manuals, compliance rulebooks, or code repositories into the prompt without fragmenting them into chunks.
For agentic systems, speed is not just a luxury; it is a functional requirement. An agent that needs to think for 30 seconds before acting is unusable for real-time customer service or interactive coding.
Throughput: On a single NVIDIA H200 GPU, Nemotron 3 Nano delivers 3.3x higher throughput than the comparable Qwen-3 (30B) model.
Edge viability: Because only ~3.2B parameters are active per token, the compute requirements for inference are minimal. While the VRAM requirement is high (to store the 30B weights), the compute latency is low. This makes it feasible to run high-intelligence agents on dual RTX 4090 workstations, enabling local, private agent deployments.
For developers and enterprise architects, deploying Nemotron 3 requires navigating a new ecosystem of tools and configurations. The model is not just a drop-in replacement for Llama; it benefits significantly from specific runtimes and prompting strategies.
The granular MoE architecture creates a unique hardware profile: high memory capacity (VRAM) but low compute intensity.
Context Window | Precision | Required VRAM | Recommended Setup |
8k Tokens | FP8 (Quantized) | ~35 GB | 2x RTX 4090 (24GB) or 1x A6000 |
8k Tokens | BF16 (Full) | ~65 GB | 1x A100 (80GB) or 3x RTX 4090 |
1M Tokens | FP8 | ~120 GB | 2x H100 (80GB) or 8x RTX 4090 |
The 1-million-token window requires a massive amount of VRAM for the KV cache, even with Mamba’s efficiency. For edge deployments, limiting context to 32k or 128k makes the model accessible on consumer hardware.
One of the most powerful features of Nemotron 3 is the ability to control its “System 2” thinking process. Similar to OpenAI's o1, the model can generate a chain-of-thought rationale before its final answer. Unlike o1, Nemotron 3 gives developers explicit control over this process via the <think> tag and a token budget.
The following code demonstrates how to interact with the model using an OpenAI-compatible client (served via vLLM), enforcing a strict budget on the number of tokens the model can spend reasoning.
In a commercial setting, every token incurs a cost. An agent that enters an infinite thinking loop can quickly deplete the budget. The reasoning_budget parameter enables enterprises to strike a balance between the depth of reasoning (accuracy) and cost and latency.
Nemotron 3 is fine-tuned to work with structured tool definitions. To maximize tool-calling accuracy, NVIDIA recommends specific sampling parameters that differ from standard chat settings.
Temperature: 0.6 (Lower than the standard 1.0, to reduce hallucination in JSON formatting).
Top_p: 0.95.
Tool parser: When running with vLLM, it is critical to enable the specific tool parser plugin (--tool-call-parser qwen3_coder) to ensure the model’s output is correctly interpreted as an API call rather than plain text.
The release of Nemotron 3 is not an act of charity; it is a strategic maneuver by NVIDIA to consolidate its dominance in the AI hardware market through sovereign AI.
The open model license: The model is released under the NVIDIA Open Model License. While permissive (allowing commercial use and derivative works), it is distinct from the truly open Apache 2.0 license.
Permissions: You can fine-tune it, distill it, and build commercial products on top of it.
Restrictions: It typically contains clauses that prevent the use of the model to improve competitors’ foundation models (e.g., you cannot use Nemotron outputs to train a model that competes with Nemotron).
Goal: NVIDIA wants to be the platform. By giving away the weights, they encourage enterprises and nations to build their AI infrastructure on-premise or in sovereign clouds, rather than relying on API providers like OpenAI or Anthropic. This drives demand for NVIDIA GPUs (H100/Blackwell), which are the only chips capable of running these models at peak efficiency.
The NVFP4 Moat: The decision to train the Super and Ultra models natively in NVFP4 (NVIDIA 4-bit floating point) is a masterstroke of ecosystem lock-in.
The format: NVFP4 uses a sophisticated block-wise scaling method that allows 4-bit numbers to retain the dynamic range of 16-bit numbers.
The hardware: Native acceleration for NVFP4 is a unique feature of the Blackwell architecture.
The implication: While the weights are open, running them on competitor hardware (AMD MI300 or Google TPU) requires upcasting them to BF16 or INT8 formats. This conversion incurs a performance penalty (either in speed or accuracy). Thus, the best version of Nemotron 3 will always run on NVIDIA hardware. This creates a soft lock-in: the software is free, but the hardware required to run it optimally is proprietary.
NVIDIA’s Nemotron 3 represents a significant advancement for open model development and signals a shift toward hybrid LLM architectures, rather than purely Transformer-based designs. By combining Mamba-2’s linear-time efficiency with selective self-attention, Nemotron 3 makes million-token context windows practical for long-horizon, agentic workflows.
Nemotron 3 also strengthens this foundation with Latent MoE efficiency and a structured RLVR training pipeline, enabling open models to compete more closely with proprietary systems in reasoning and math. Features like SteerLM and Thinking Budgets make advanced alignment and controllability more accessible, capabilities that were previously limited to closed labs.
Feature | Nemotron 3 Nano | Nemotron 3 Super (Est.) | Nemotron 3 Ultra (Est.) |
Release Date | Dec 2025 | H1 2026 | H1 2026 |
Total Params | 31.6 Billion | ~100 Billion | ~500 Billion |
Active Params | ~3.2 Billion | ~10 Billion | ~50 Billion |
Architecture | Hybrid Mamba-Transformer | Hybrid + Latent MoE | Hybrid + Latent MoE |
Expert Routing | Granular (128 Experts) | Latent (High Granularity) | Latent (High Granularity) |
Context Window | 1 Million Tokens | 1 Million Tokens | 1 Million Tokens |
Training Precision | BF16 | NVFP4 | NVFP4 |
Key Capability | Edge Inference, Tool Use | High-Vol. Swarms, RAG | Deep Reasoning, MTP |
Deployment | 1x A100 / 2x 4090 | 1-2x H100 | Multi-Node H100 Cluster |
Even with powerful models, agentic systems succeed or fail based on the surrounding system design, how workflows are orchestrated, memory management, tool integration, and the enforcement of safety controlsyou orchestrate workflows, manage memory, integrate tools, and enforce safety controls. If you’'re looking to go beyond prototypes and build autonomous systems that withstand production, Educative offers specialized technical tracks that teach software engineers how to design and develop real-world agentic systems.