We are excited to announce that SGLang supports the latest highly efficient NVIDIA Nemotron 3 Nano model on Day 0!
Nemotron 3 Nano, part of the newly announced open Nemotron 3 family, is a compact MoE language model offering industry-leading compute efficiency and accuracy, enabling developers to build specialized agentic AI systems.
Nemotron 3 Nano is fully open with open-weights, datasets and recipes so developers can easily customize, optimize, and deploy the model on their infrastructure for maximum privacy and security. The chart below shows that Nemotron 3 Nano is in the most attractive quadrant in Artificial Analysis Openness vs Intelligence Index.

NVIDIA Nemotron 3 Sets a New Standard for Open Source AI
TL;DR
-
Architecture:
-
Mixture of Experts (MoE) with Hybrid Transformer-Mamba Architecture
-
Supports Thinking Budget for providing optimal accuracy with minimum reasoning token generation
-
Accuracy
-
Leading accuracy on coding, scientific reasoning, math, and instruction following
-
Model size: 30B with 3.6B active parameters
-
Context length: 1M
-
Model input: Text
-
Model output: Text
-
Supported GPUs: NVIDIA RTX Pro 6000, DGX Spark, H100, B200.
-
Get started:
-
Technical report to build custom, optimized models with Nemotron techniques.
Installation and Quick Start
For an easier setup with SGLang, refer to our getting started cookbook, available here or through NVIDIA Brev launchable.
Run the command below to install dependencies:
pip install sglang==0.5.6.post2.dev7852+g8102e36b5 --extra-index-url https://sgl-project.github.io/whl/nightly/
We can then serve this model:
# BF16
python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16 --trust-remote-code --reasoning-parser nano_v3 --tool-call-parser qwen3_coder
# Swap out model name for FP8
python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-FP8 --trust-remote-code --reasoning-parser nano_v3 --tool-call-parser qwen3_coder
Once the server is up and running, you can prompt the model using the below code snippets:
from openai import OpenAI
# The model name we used when launching the server.
SERVED_MODEL_NAME = "nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16"
BASE_URL = f"http://localhost:30000/v1"
API_KEY = "EMPTY" # SGLang server doesn't require an API key by default
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)
resp = client.chat.completions.create(
model=SERVED_MODEL_NAME,
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Give me 3 bullet points about SGLang."}
],
temperature=0.6,
max_tokens=512,
)
print(resp.choices[0].message.reasoning_content, resp.choices[0].message.content)
Nemotron 3 Nano provides highest efficiency with leading accuracy for building AI agents
Nemotron 3 Nano builds on the hybrid Mamba-Transformer architecture by replacing standard feed-forward network (FFN) layers with MoE layers and most of the attention layers with Mamba-2. This enables higher accuracy while using only a fraction of the active parameters. By leveraging MoE, Nemotron 3 Nano reduces compute demands and satisfies the tight latency constraints required for real-world deployment.
Nemotron 3 Nano’s hybrid Mamba-Transformer architecture boosts token throughput by up to 4x, allowing the model to reason more quickly while delivering higher accuracy. Its “thinking budget” feature helps avoid unnecessary computation, reducing overthinking and ensuring lower, more predictable inference costs.

Nemotron 3 Nano delivers higher throughput with leading accuracy among open reasoning models
Trained on NVIDIA-curated, high-quality data, Nemotron 3 Nano leads on benchmarks such as SWE Bench Verified, GPQA Diamond, AIME 2025, Arena Hard v2, and IFBench delivering top-tier accuracy in coding, reasoning, math and instruction following. This makes it ideal for building AI agents for various enterprise use cases including finance, cybersecurity, software development and retail.

Nemotron 3 Nano provides leading accuracy on various popular academic benchmarks among open small reasoning models
Get Started
- Download Nemotron 3 Nano model weights from Hugging Face - BF16, FP8
- Run with SGLang for inference using this cookbook or through this NVIDIA Brev launchable.
Further Reading
- Share your ideas and vote on what matters to help shape the future of Nemotron.
- Stay up to date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, YouTube, and the Nemotron channel on Discord.
Acknowledgement
We thank all contributors for their efforts in developing and integrating Nemotron V3 Nano into SGLang.
Nvidia Team: Roi Koren, Max Xu, Netanel Haber, Tomer Bar Natan, Daniel Afrimi, Nirmal Kumar Juluru, Ann Guan and many more
SGLang Team and community: Baizhou Zhang, Jiajun Li, Ke Bao, Mingyi Lu, Richard Chen