doc trl integration

cmunley1 · cmunley1 · commit 3e65e9dcc032 · 2026-01-22T15:23:49.000-08:00
Signed-off-by: cmunley1 &lt;cmunley@nvidia.com&gt;
diff --git a/docs/contribute/rl-framework-integration/index.md b/docs/contribute/rl-framework-integration/index.md
@@ -8,9 +8,22 @@ These guides cover how to integrate NeMo Gym into a new RL training framework. U
 - Contributing NeMo Gym integration for a training framework that does not have one yet
 
 :::{tip}
-Just want to train models? Use {ref}`NeMo RL <training-nemo-rl-grpo-index>` instead.
+Just want to train models? See existing integrations:
+- {ref}`NeMo RL <training-nemo-rl-grpo-index>` - Multi-step and multi-turn RL training at scale
+- {ref}`TRL (Hugging Face) <training-trl>` - GRPO training with distributed training support
+- {ref}`Unsloth <training-unsloth>` - Fast, memory-efficient training for single-step tasks
 :::
 
+## Existing Integrations
+
+NeMo Gym currently integrates with the following RL training frameworks:
+
+**[NeMo RL](https://github.com/NVIDIA-NeMo/RL)**: NVIDIA's RL training framework, purpose-built for large-scale frontier model training. Provides full support for multi-step and multi-turn environments with production-grade distributed training capabilities.
+
+**[TRL](https://github.com/huggingface/trl)**: Hugging Face's transformer reinforcement learning library. Supports GRPO with single and multi-turn NeMo Gym environments using vLLM generation, multi-environment training, and distributed training via Accelerate and DeepSpeed. See the {ref}`TRL tutorial <training-trl>` for usage examples.
+
+**[Unsloth](https://github.com/unslothai/unsloth)**: Fast, memory-efficient fine-tuning library. Supports optimized GRPO with single and multi-turn NeMo Gym environments including low precision, parameter-efficient fine-tuning, and training in notebook environments. See the {ref}`Unsloth tutorial <training-unsloth>` for getting started.
+
 ## Prerequisites
 
 Before integrating Gym into your training framework, ensure you have:
diff --git a/docs/index.md b/docs/index.md
@@ -101,197 +101,51 @@ Detailed walkthrough of running your first training environment.
 :::{grid-item-card} {octicon}`iterations;1.5em;sd-mr-1` Rollout Collection
 :link: get-started/rollout-collection
 :link-type: doc
-Collect and view rollouts.
+Collect and view rollouts
 +++
 {bdg-secondary}`rollouts` {bdg-secondary}`training-data`
 :::
 
-:::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` First Training Run
-:link: get-started/first-training-run
-:link-type: doc
-Train your first model using collected rollouts.
-+++
-{bdg-secondary}`training` {bdg-secondary}`grpo`
-:::
-
-::::
-
-## Server Components
-
-Configure and customize the three server components of a training environment.
-
-::::{grid} 1 2 2 2
-:gutter: 1 1 1 2
-
-:::{grid-item-card} {octicon}`cpu;1.5em;sd-mr-1` Model Server
-:link: model-server/index
-:link-type: doc
-Configure LLM inference backends: vLLM, OpenAI, Azure.
-+++
-{bdg-secondary}`inference` {bdg-secondary}`vllm` {bdg-secondary}`openai`
-:::
-
-:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Resources Server
-:link: resources-server/index
-:link-type: doc
-Define tasks, tools, and verification logic.
-+++
-{bdg-secondary}`tools` {bdg-secondary}`verification`
-:::
-
-:::{grid-item-card} {octicon}`workflow;1.5em;sd-mr-1` Agent Server
-:link: agent-server/index
-:link-type: doc
-Orchestrate rollout lifecycle and tool calling.
-+++
-{bdg-secondary}`agents` {bdg-secondary}`orchestration`
-:::
-
-:::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` Data
-:link: data/index
-:link-type: doc
-Prepare and validate training datasets.
-+++
-{bdg-secondary}`datasets` {bdg-secondary}`jsonl`
-:::
-
 ::::
 
-## Environment Tutorials
+<!-- This section needs to match the content in docs/tutorials/index.md -->
+## Tutorials
 
-Learn how to build custom training environments for various RL scenarios.
+Hands-on tutorials to build and customize your training environments.
 
 ::::{grid} 1 2 2 2
 :gutter: 1 1 1 2
 
-:::{grid-item-card} {octicon}`plus-circle;1.5em;sd-mr-1` Creating Environments
-:link: environment-tutorials/creating-training-environment
-:link-type: doc
-Build a complete training environment from scratch.
-+++
-{bdg-primary}`beginner` {bdg-secondary}`foundational`
-:::
-
-:::{grid-item-card} {octicon}`iterations;1.5em;sd-mr-1` Multi-Step
-:link: environment-tutorials/multi-step
-:link-type: doc
-Sequential tool calling workflows.
-+++
-{bdg-secondary}`multi-step` {bdg-secondary}`tools`
-:::
-
-:::{grid-item-card} {octicon}`comment-discussion;1.5em;sd-mr-1` Multi-Turn
-:link: environment-tutorials/multi-turn
+:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Build a Resource Server
+:link: tutorials/creating-resource-server
 :link-type: doc
-Conversational training environments.
+Implement or integrate existing tools and define task verification logic.
 +++
-{bdg-secondary}`multi-turn` {bdg-secondary}`dialogue`
+{bdg-primary}`beginner` {bdg-secondary}`30 min` {bdg-secondary}`custom-environments` {bdg-secondary}`tools`
 :::
 
-:::{grid-item-card} {octicon}`law;1.5em;sd-mr-1` LLM-as-a-Judge
-:link: environment-tutorials/llm-as-judge
-:link-type: doc
-LLM-based response verification.
+:::{grid-item-card} {octicon}`workflow;1.5em;sd-mr-1` Offline Training with Rollouts
+:link: offline-training-w-rollouts
+:link-type: ref
+Transform rollouts into training data for {term}`supervised fine-tuning (SFT) <SFT (Supervised Fine-Tuning)>` and {term}`direct preference optimization (DPO) <DPO (Direct Preference Optimization)>`.
 +++
-{bdg-secondary}`verification` {bdg-secondary}`llm-judge`
+{bdg-secondary}`sft` {bdg-secondary}`dpo`
 :::
 
-::::
-
-```{button-ref} environment-tutorials/index
-:ref-type: doc
-:color: secondary
-:class: sd-rounded-pill
-
-View all environment tutorials →
-```
-
-## Training Tutorials
-
-Train models using NeMo Gym with various RL frameworks.
-
-::::{grid} 1 2 2 2
-:gutter: 1 1 1 2
-
-:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` NeMo RL with GRPO
+:::{grid-item-card} {octicon}`workflow;1.5em;sd-mr-1` GRPO with NeMo RL
 :link: training-nemo-rl-grpo-index
 :link-type: ref
-Multi-node GRPO training for production workloads.
+Learn how to set up NeMo Gym and NeMo RL training environments, run tests, prepare data, and launch single-node and multi-node training runs.
 +++
-{bdg-primary}`recommended` {bdg-secondary}`grpo` {bdg-secondary}`multi-node`
+{bdg-primary}`training` {bdg-secondary}`rl` {bdg-secondary}`grpo` {bdg-secondary}`multi-step`
 :::
 
 :::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` Unsloth
 :link: training-unsloth
 :link-type: ref
-Fast, memory-efficient fine-tuning on single GPU.
-+++
-{bdg-secondary}`unsloth` {bdg-secondary}`efficient`
-:::
-
-:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` TRL
-:link: training-tutorials/trl
-:link-type: doc
-HuggingFace TRL integration for PPO and DPO.
-+++
-{bdg-secondary}`trl` {bdg-secondary}`huggingface`
-:::
-
-:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` VeRL
-:link: training-tutorials/verl
-:link-type: doc
-VeRL framework for research workflows.
-+++
-{bdg-secondary}`verl` {bdg-secondary}`research`
-:::
-
-:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` NeMo Customizer
-:link: training-tutorials/nemo-customizer
-:link-type: doc
-Enterprise training with NeMo Customizer.
+Fast, memory-efficient fine-tuning for single-step tasks: math, structured outputs, instruction following, reasoning gym and more.
 +++
-{bdg-secondary}`nemo-customizer` {bdg-secondary}`enterprise`
-:::
-
-:::{grid-item-card} {octicon}`file;1.5em;sd-mr-1` Offline Training
-:link: offline-training-w-rollouts
-:link-type: ref
-SFT and DPO from collected rollouts.
-+++
-{bdg-secondary}`sft` {bdg-secondary}`dpo`
-:::
-
-::::
-
-```{button-ref} training-tutorials/index
-:ref-type: doc
-:color: secondary
-:class: sd-rounded-pill
-
-View all training tutorials →
-```
-
-## Infrastructure
-
-Deploy and scale NeMo Gym for production workloads.
-
-::::{grid} 1 2 2 2
-:gutter: 1 1 1 2
-
-:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Deployment Topology
-:link: infrastructure/deployment-topology
-:link-type: doc
-Production deployment patterns and configurations.
-+++
-{bdg-secondary}`deployment` {bdg-secondary}`topology`
-:::
-
-:::{grid-item-card} {octicon}`broadcast;1.5em;sd-mr-1` Distributed Computing with Ray
-:link: infrastructure/ray-distributed
-:link-type: doc
-Scale with Ray clusters for high-throughput rollout collection.
-+++
-{bdg-secondary}`ray` {bdg-secondary}`distributed`
+{bdg-primary}`training` {bdg-secondary}`unsloth` {bdg-secondary}`single-step`
 :::
 
 ::::
@@ -335,8 +189,6 @@ Home <self>
 
 Overview <about/index.md>
 Concepts <about/concepts/index>
-🟡 Architecture <about/architecture>
-🟡 Performance <about/performance>
 Ecosystem <about/ecosystem>
 ```
 
@@ -348,91 +200,19 @@ Ecosystem <about/ecosystem>
 Quickstart <get-started/index>
 Detailed Setup Guide <get-started/detailed-setup.md>
 Rollout Collection <get-started/rollout-collection.md>
-🟡 First Training Run <get-started/first-training-run.md>
-```
-
-```{toctree}
-:caption: Model Server
-:hidden:
-:maxdepth: 1
-
-🟡 Overview <model-server/index>
-🟡 vLLM <model-server/vllm>
-🟡 OpenAI <model-server/openai>
-🟡 Azure OpenAI <model-server/azure-openai>
-🟡 Responses API <model-server/responses-native>
-```
-
-```{toctree}
-:caption: Resources Server
-:hidden:
-:maxdepth: 1
-
-🟡 Overview <resources-server/index>
-🟡 Integrate Python Tools <resources-server/integrate-python-tools>
-🟡 Integrate APIs <resources-server/integrate-apis>
-🟡 Containerize <resources-server/containerize>
-🟡 Profile <resources-server/profile>
-```
-
-```{toctree}
-:caption: Agent Server
-:hidden:
-:maxdepth: 1
-
-🟡 Overview <agent-server/index>
-🟡 Integrate Agents <agent-server/integrate-agents/index>
-```
-
-```{toctree}
-:caption: Data
-:hidden:
-:maxdepth: 1
-
-🟡 Overview <data/index>
-🟡 Prepare and Validate <data/prepare-validate>
-🟡 Download from Hugging Face <data/download-huggingface>
-```
-
-```{toctree}
-:caption: Environment Tutorials
-:hidden:
-:maxdepth: 1
-
-🟡 Overview <environment-tutorials/index>
-🟡 Creating Training Environment <environment-tutorials/creating-training-environment>
-🟡 Multi-Step <environment-tutorials/multi-step>
-🟡 Multi-Turn <environment-tutorials/multi-turn>
-🟡 User Modeling <environment-tutorials/user-modeling>
-🟡 Multi-Node Docker <environment-tutorials/multi-node-docker>
-🟡 LLM as Judge <environment-tutorials/llm-as-judge>
-🟡 RLHF Reward Models <environment-tutorials/rlhf-reward-models>
-```
-
-```{toctree}
-:caption: Training Tutorials
-:hidden:
-:maxdepth: 1
-
-🟡 Overview <training-tutorials/index>
-🟡 Nemotron Nano <training-tutorials/nemotron-nano>
-🟡 Nemotron Super <training-tutorials/nemotron-super>
-NeMo RL GRPO <tutorials/nemo-rl-grpo/index.md>
-Unsloth Training <tutorials/unsloth-training>
-🟡 TRL <training-tutorials/trl>
-🟡 VERL <training-tutorials/verl>
-🟡 NeMo Customizer <training-tutorials/nemo-customizer>
-Offline Training <tutorials/offline-training-w-rollouts>
 ```
 
 ```{toctree}
-:caption: Infrastructure
+:caption: Tutorials
 :hidden:
 :maxdepth: 1
 
-🟡 Overview <infrastructure/index>
-🟡 Deployment Topology <infrastructure/deployment-topology>
-🟡 Ray Distributed <infrastructure/ray-distributed>
+tutorials/index.md
+tutorials/creating-resource-server
+tutorials/offline-training-w-rollouts
+tutorials/nemo-rl-grpo/index.md
+tutorials/trl-training
+tutorials/unsloth-training
 ```
 
 ```{toctree}
diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
@@ -1,7 +1,3 @@
----
-orphan: true
----
-
 (tutorials-index)=
 
 # NeMo Gym Tutorials
@@ -64,10 +60,18 @@ Learn how to set up NeMo Gym and NeMo RL training environments, run tests, prepa
 {bdg-primary}`training` {bdg-secondary}`rl` {bdg-secondary}`grpo` {bdg-secondary}`multi-step`
 :::
 
+:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` TRL (Hugging Face)
+:link: training-trl
+:link-type: ref
+Train models using Hugging Face TRL with GRPO in NeMo Gym environments. Supports multi-step tool calling, multi-environment and distributed training.
++++
+{bdg-primary}`training` {bdg-secondary}`trl` {bdg-secondary}`grpo` {bdg-secondary}`multi-step`
+:::
+
 :::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` Unsloth
 :link: training-unsloth
 :link-type: ref
-Fast, memory-efficient fine-tuning for single-step tasks: math, structured outputs, instruction following, reasoning gym and more.
+Fast, memory-efficient GRPO in NeMo-Gym environments, including multi-step tool calling and multi-environment training.
 +++
 {bdg-primary}`training` {bdg-secondary}`unsloth` {bdg-secondary}`single-step`
 :::
diff --git a/docs/tutorials/trl-training.md b/docs/tutorials/trl-training.md