Skip to content

Developer Asset Hub for NVIDIA Nemotron — A one-stop resource for training recipes, usage cookbooks, datasets, and full end-to-end reference examples to build with Nemotron models

License

Notifications You must be signed in to change notification settings

NVIDIA-NeMo/Nemotron

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NVIDIA Nemotron Developer Repository

Developer companion repo for working with NVIDIA's Nemotron models: inference, fine-tuning, agents, visual reasoning, deployment.

Python 3.10+ License: Apache 2.0 Contributions Welcome


📂 Repo Layout

nemotron/
│
├── usage-cookbook/        Usage cookbooks (how to deploy, and simple model usage guides)
│
│
└── use-case-examples/     Examples of leveraging Nemotron Models in Agentic Workflows and more 

What is Nemotron?

NVIDIA Nemotron™ is a family of open, high-efficiency models with fully transparent training data, weights, and recipes.

Nemotron models are designed for agentic AI workflows — they excel at coding, math, scientific reasoning, tool calling, instruction following, and visual reasoning (for the VL models).

They are optimized for deployment across a spectrum of compute tiers (edge, single GPU, data center) and support frameworks like NeMo and TensorRT-LLM, vLLM, and SGLang, with NIM microservice options for scalable serving.


More Resources


💡 Feature Requests & Ideas

Have an idea for improving Nemotron models? Create an issue and tag it idea!

Your feedback helps shape the future of Nemotron models!


Training Recipes (Coming Soon)

Full, reproducible training pipelines will be included in the nemotron package at src/nemotron/recipes/.

Each Recipe Includes


Model Specific Usage Cookbooks

Learn how to deploy and use the models through an API.

Model Best For Key Features Trade-offs Resources
NVIDIA-Nemotron-3-Nano High-throughput agentic workflows, reasoning, tool-use, chat • 31.6B total / 3.6B active (MoE)
• Hybrid Mamba-Transformer MoE
• 1M-token context window
• Reasoning ON/OFF + thinking budget
Sparse MoE trades total params for efficiency 📁 Cookbooks
Llama-3.3-Nemotron-Super-49B-v1.5 Production deployments needing strong reasoning with efficiency • 128K context
• Single H200 GPU
• RAG & tool calling
• Optimized via NAS
Balances accuracy & throughput 📁 Cookbooks
NVIDIA-Nemotron-Nano-9B-v2 Resource-constrained environments needing flexible reasoning • 9B params
• Hybrid Mamba-2 architecture
• Controllable reasoning traces
• Unified reasoning/non-reasoning
Smaller model with configurable reasoning 📁 Cookbooks
NVIDIA-Nemotron-Nano-12B-v2-VL Document intelligence and video understanding • 12B VLM
• Video & multi-image reasoning
• Controllable reasoning (/think mode)
• Efficient Video Sampling (EVS)
Vision-language with configurable reasoning 📁 Cookbooks
Llama-3.1-Nemotron-Safety-Guard-8B-v3 Multilingual content moderation with cultural nuance • 9 languages
• 23 safety categories
• Cultural sensitivity
• NeMo Guardrails integration
Focused on safety/moderation tasks 📁 Cookbooks
Nemotron-Parse (link coming soon!) Document parsing for RAG and AI agents • VLM for document parsing
• Table extraction (LaTeX)
• Semantic segmentation
• Spatial grounding (bbox)
Specialized for document structure 📁 Cookbooks

Nemotron Use Case Examples

Below is an outline of the end-to-end use case examples provided in the use-case-examples directory. These scenarios demonstrate practical applications that go beyond basic model inference.

What You'll Find

  • Agentic Workflows
    Orchestration of multi-step AI agents, integrating planning, context management, and external tools/APIs.

  • Retrieval-Augmented Generation (RAG) Systems
    Building pipelines that combine retrieval components (vector databases, search APIs) with Nemotron models for grounded, accurate outputs.

  • Integration with External Tools & APIs
    Examples of Nemotron models powering applications with structured tool calling, function execution, or data enrichment.

  • Production-Ready Application Patterns
    Architectures supporting scalability, monitoring, data pipelines, and real-world deployment considerations.

See the use-case-examples/ subfolders for in-depth, runnable examples illustrating these concepts.

Nemotron Open Datasets

More than just weights, recipes, and libraries: Nemotron is commited to opening data across many domains, training phases, and use cases.

Nemotron Data Catalogue

A comprehensive collection of NVIDIA Nemotron datasets spanning pre-training, post-training, reinforcement learning, multimodal, safety, and domain-specific applications. These openly available datasets power the Nemotron family of models for agentic AI development.


Code

Datasets for training code generation, competitive programming, and software engineering capabilities across multiple programming languages.

Dataset Usage License Model(s) Description
Nemotron-CC-Code-v1 Pre-training NVIDIA Data Agreement Nemotron 3 Nano 427.9B tokens from Common Crawl code pages using Lynx + LLM pipeline
Nemotron-Pretraining-Code-v1 Pre-training NVIDIA Data Agreement Nemotron Nano 2 GitHub-sourced code corpus for Nemotron Nano 2
Nemotron-Pretraining-Code-v2 Pre-training NVIDIA Data Agreement Nemotron 3 Nano Updated GitHub code + synthetic QA with STEM reasoning
Nemotron-Cascade-RL-SWE RL Training CC-BY-4.0 Nemotron 3 SWE code repair from SWE-Bench, SWE-Smith, R2E-Gym
Nemotron-Competitive-Programming-v1 SFT CC-BY-4.0 Nemotron 3 2M+ Python and 1M+ C++ samples across 34K competitive programming questions
OpenCodeReasoning SFT CC-BY-4.0 OpenCode-Nemotron 735K Python samples across 28K competitive programming questions
OpenCodeReasoning-2 SFT CC-BY-4.0 OpenCode-Nemotron 2.5M samples (1.4M Python, 1.1M C++) with code completion and critique
Scoring-Verifiers Evaluation CC-BY-4.0 Benchmark for test case generation and code reward models

Math

Mathematical reasoning datasets ranging from pre-training corpora to advanced problem-solving with chain-of-thought and tool-integrated reasoning. Includes the AIMO-2 competition winning dataset.

Dataset Usage License Model(s) Description
Nemotron-CC-Math-v1 Pre-training NVIDIA Data Agreement Nemotron Nano 2, Nemotron 3 Nano 133B-token math dataset from Common Crawl using Lynx + LLM pipeline
Nemotron-Math-Proofs-v1 SFT CC-BY-4.0 Nemotron 3 Nano Mathematical proofs dataset for Nemotron 3 post-training
Nemotron-Math-v2 SFT CC-BY-4.0 Nemotron 3 347K samples and 7M reasoning trajectories for Deeper Math Reasoning
Nemotron-CrossThink RL Training CC-BY-4.0 Nemotron 3 Multi-domain QA with MCQ and open-ended formats for verifiable rewards
OpenMathReasoning SFT CC-BY-4.0 OpenMath-Nemotron 5.68M samples, 306K problems from AoPS with CoT/TIR (AIMO-2 winner)

Science / STEM

Scientific reasoning datasets covering chemistry, physics, and general STEM domains for training models on scientific question answering and reasoning.

Dataset Usage License Model(s) Description
Nemotron-Science-v1 SFT CC-BY-4.0 Nemotron 3 Nano Synthetic science reasoning (MCQA + chemistry RQA)

General / Web

Large-scale web-crawled and curated datasets for pre-training and post-training, including multilingual data and general instruction-following capabilities.

Dataset Usage License Model(s) Description
Nemotron-CC-v2.1 Pre-training NVIDIA Data Agreement Nemotron 3 Nano 2.5T tokens English web data with synthetic rephrases and translations
Nemotron-CC-v2 Pre-training NVIDIA Data Agreement Nemotron Nano 2 6.6T tokens quality-filtered Common Crawl with multilingual Q&A
Nemotron-Pretraining-Dataset-sample Pre-training (Sample) NVIDIA Data Agreement Sample subset of Nemotron pre-training corpus for experimentation
Llama-Nemotron-Post-Training-Dataset SFT + RL CC-BY-4.0 Llama-Nemotron Ultra/Super/Nano Math, code, reasoning data (2.2M math, 500K code)
Nemotron-Post-Training-Dataset-v1 SFT CC-BY-4.0 Llama-3.3-Nemotron-Super-49B-v1.5 Math, code, STEM, tool calling
Nemotron-Post-Training-Dataset-v2 SFT + RL CC-BY-4.0 Llama-Nemotron Multilingual extension (Spanish, French, German, Italian, Japanese)
Nemotron-3-Nano-RL-Training-Blend RL Training CC-BY-4.0 Nemotron-3-Nano-30B-A3B Curated multi-domain blend for Nemotron 3 Nano
Nemotron-RL-knowledge-web_search-mcqa RL Training ODC-BY-1.0 Nemotron 3 Web search and multiple-choice QA tasks for NeMo Gym

Chat / Instruction Following

Datasets for training conversational AI with strong instruction-following capabilities, structured output generation, and multi-turn dialogue.

Dataset Usage License Model(s) Description
Nemotron-Instruction-Following-Chat-v1 SFT CC-BY-4.0 Nemotron 3 Nano Multi-turn chat and structured output generation
Nemotron-RL-instruction_following RL Training ODC-BY-1.0 Nemotron 3 Verifiable instruction adherence from WildChat-1M + Open-Instruct
Nemotron-RL-instruction_following-structured_outputs RL Training ODC-BY-1.0 Nemotron 3 JSON schema-constrained output formatting tests
Nemotron-Cascade-RL-Instruction-Following RL Training ODC-BY-1.0 Nemotron 3 108K samples for instruction-following RL

Agentic / Tool Use

Datasets for training AI agents with tool calling, multi-step workflows, and agentic reasoning capabilities.

Dataset Usage License Model(s) Description
Nemotron-Agentic-v1 SFT CC-BY-4.0 Nemotron 3 Nano Multi-turn trajectories for conversational tool use and agentic workflows
Nemotron-RL-agent-workplace_assistant RL Training ODC-BY-1.0 Nemotron 3 Workplace assistant agent tasks for NeMo Gym

Alignment / Reward Modeling

Human preference and reward modeling datasets for RLHF, SteerLM training, and model alignment. Powers top-performing reward models on RM-Bench and JudgeBench.

Dataset Usage License Model(s) Description
HelpSteer3 Reward Modeling CC-BY-4.0 Nemotron 3 Nano, Llama-Nemotron Super 49B 40K+ samples; top on RM-Bench/JudgeBench with preference, feedback, edit-quality
HelpSteer2 Reward Modeling CC-BY-4.0 Nemotron-4-340B-Reward, Llama-3.1-Nemotron-70B-Reward 21K samples with 5 attributes
HelpSteer SteerLM Training CC-BY-4.0 Nemotron-4 SteerLM 37K samples (helpfulness, correctness, coherence, complexity, verbosity)
Daring-Anteater SFT/RLHF CC-BY-4.0 Nemotron-4-340B-Instruct Instruction tuning dataset; synthetic subsets + FinQA, wikitablequestions
sft_datablend_v1 SFT CC-BY-4.0 SFT data blend for RLHF pipeline

Vision-Language / Multimodal

High-quality VLM training data for document intelligence, OCR, image reasoning, video QA, and chain-of-thought visual understanding.

Dataset Usage License Model(s) Description
Nemotron-VLM-Dataset-v2 VLM Training CC-BY-4.0 (some CC-BY-SA-4.0) Nemotron VLM 8M samples for OCR, image reasoning, video QA with chain-of-thought
Llama-Nemotron-VLM-Dataset-v1 VLM Training CC-BY-4.0 (some CC-BY-SA-4.0) Llama-3.1-Nemotron-Nano-VL-8B 3M samples for visual question answering and captioning

Physical AI / Robotics

Datasets for embodied reasoning, physical common sense, and robotic manipulation. Powers Cosmos-Reason1 for physical AI applications.

Dataset Usage License Model(s) Description
Cosmos-Reason1-SFT-Dataset SFT CC-BY-4.0 Cosmos-Reason1-7B Video-text pairs for robotics, ego-centric demos, AV reasoning
Cosmos-Reason1-RL-Dataset RL Training CC-BY-4.0 Cosmos-Reason1-7B RL data for physical common sense and embodied reasoning
Cosmos-Reason1-Benchmark Evaluation CC-BY-4.0 Benchmark for embodied reasoning (robotics, HoloAssist, AV)
PhysicalAI-Robotics-Manipulation-Augmented Training CC-BY-4.0 1K Franka Panda demos with Cosmos Transfer1 domain augmentation

Autonomous Vehicles

Multi-sensor driving data and synthetic scenarios for training and validating autonomous vehicle systems.

Dataset Usage License Model(s) Description
PhysicalAI-Autonomous-Vehicles Training NVIDIA AV Dataset License 1,700 hours multi-sensor data from 25 countries, 306K clips
PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams SDG CC-BY-4.0 Cosmos 81K synthetic videos with LiDAR and HD-map annotations
PhysicalAI-Autonomous-Vehicle-Cosmos-Synthetic SDG CC-BY-4.0 Cosmos Cosmos-generated synthetic driving scenarios
PhysicalAI-Autonomous-Vehicles-NuRec Reconstruction NVIDIA AV Dataset License NuScenes-based reconstruction data

Synthetic Personas / Data Generation

Privacy-safe synthetic personas grounded in real-world demographics for sovereign AI development and synthetic data generation pipelines.

Dataset Usage License Model(s) Description
Nemotron-Personas-USA SDG CC-BY-4.0 NeMo Data Designer 1M US personas grounded in Census demographics
Nemotron-Personas-Japan SDG CC-BY-4.0 NeMo Data Designer 1M Japanese personas aligned with regional statistics
Nemotron-Personas-India SDG CC-BY-4.0 NeMo Data Designer 3M Indian personas for sovereign AI development
Nemotron-Personas SDG CC-BY-4.0 NeMo Data Designer 100K US personas with 22 fields aligned to Census data

Privacy / PII Detection

Synthetic datasets for training named entity recognition models to detect and redact personally identifiable information.

Dataset Usage License Model(s) Description
Nemotron-PII NER Training CC-BY-4.0 GLiNER-PII 100K synthetic records with 55+ PII/PHI entity types

Safety / Content Moderation

Content safety datasets for training guardrail models covering comprehensive risk taxonomies. Powers NemoGuard content safety models.

Dataset Usage License Model(s) Description
Aegis-AI-Content-Safety-Dataset-1.0 Content Moderation CC-BY-4.0 NemoGuard Permissive/Defensive 11K annotated interactions covering 13 risk categories
Aegis-AI-Content-Safety-Dataset-2.0 Content Moderation CC-BY-4.0 Llama-3.1-NemoGuard-8B-ContentSafety Extended safety dataset with 23 violation categories
Nemotron-Content-Safety-Audio-Dataset Audio Safety CC-BY-4.0 1.9K audio files from Aegis 2.0 with accent diversity

RAG / Conversational QA

Training and evaluation data for retrieval-augmented generation and conversational question answering. Powers ChatQA models.

Dataset Usage License Model(s) Description
ChatRAG-Bench Evaluation Other (derived) Benchmark across 10 datasets for document QA and unanswerable detection
ChatQA-Training-Data SFT Other (derived) ChatQA-1.5 Training data for ChatQA models from multiple sources
ChatQA2-Long-SFT-data SFT Other (derived) ChatQA-2 128K long-context training data for ChatQA-2

Biology / Drug Discovery

Protein sequence data for training biological foundation models.

Dataset Usage License Model(s) Description
esm2_uniref_pretraining_data Pre-training CC-BY-4.0 ESM2-nv 188M protein sequences for ESM2

3D / Spatial Intelligence

Testing and synthetic data for 3D reconstruction, video generation, and spatial understanding models.

Dataset Usage License Model(s) Description
Lyra-Testing-Example Evaluation CC-BY-4.0 Lyra Testing examples for Lyra generative 3D reconstruction
PhysicalAI-SpatialIntelligence-Lyra-SDG SDG CC-BY-4.0 Lyra Synthetic data for spatial intelligence models
GEN3C-Testing-Example Evaluation CC-BY-4.0 GEN3C Testing examples for GEN3C video generation
ChronoEdit-Example-Dataset Evaluation CC-BY-4.0 ChronoEdit Temporal reasoning examples for image editing

Contributing

We welcome contributions! Whether it's examples, recipes, or other tools you'd find useful.

Please read our Contributing Guidelines before submitting pull requests.

Documentation


License

Apache 2.0 License - see LICENSE file for details.


NVIDIA Nemotron - Open, transparent, and reproducible.

About

Developer Asset Hub for NVIDIA Nemotron — A one-stop resource for training recipes, usage cookbooks, datasets, and full end-to-end reference examples to build with Nemotron models

Resources

License

Contributing

Stars

Watchers

Forks

Contributors 10