Skip to content
Open
12 changes: 7 additions & 5 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ content_type: index

# NeMo AutoModel Documentation

PyTorch-native training that scales from 1 GPU to thousands with a single config change. Load any Hugging Face model, point at your data, and start training -- no checkpoint conversion, no boilerplate.
PyTorch-native training that scales from 1 GPU to thousands with a single config change. Load any Hugging Face model, point at your data, and start training; no checkpoint conversion and no boilerplate.
**Quick links:** [🤗 HF Compatible](guides/huggingface-api-compatibility.md) | [🚀 Performance](performance-summary.md) | [📐 Scalability](about/key-features.md) | [🎯 SFT & PEFT](guides/llm/finetune.md) | [🎨 Diffusion](guides/diffusion/finetune.md) | [👁️ VLM](guides/vlm/gemma4.md)

::::{grid} 2 2 2 2
Expand Down Expand Up @@ -84,12 +84,12 @@ New models are added regularly. Pick a model below to start fine-tuning, or see

## Recipes & Guides

Find the right guide for your task -- fine-tuning, pretraining, distillation, diffusion, and more.
Find the right guide for your task: fine-tuning, pretraining, distillation, diffusion, and more.

| I want to... | Choose this when... | Input Data | Model | Guide |
| --------------------------- | ----------------------------------------------------------------------------------- | ------------------------------------------------- | --------- | --------------------------------------------------------- |
| **SFT (full fine-tune)** | You need maximum accuracy and have the GPU budget to update all weights | Instruction / chat dataset | LLM | [Start fine-tuning](guides/llm/finetune.md) |
| **PEFT (LoRA)** | You want to fine-tune on limited GPU memory; updates <1 % of parameters | Instruction / chat dataset | LLM | [Start LoRA](guides/llm/finetune.md) |
| **PEFT (LoRA)** | You want to fine-tune on limited GPU memory; updates <1 % of parameters | Instruction / chat dataset | LLM | [Start LoRA](guides/llm/finetune.md) |
| **Tool / function calling** | Your model needs to call APIs or tools with structured arguments | Function-calling dataset (queries + tool schemas) | LLM | [Add tool calling](guides/llm/toolcalling.md) |
| **Fine-tune VLM** | Your task involves both images and text (e.g., visual QA, captioning) | Image + text dataset | VLM | [Fine-tune VLM](guides/omni/gemma3-3n.md) |
| **Fine-tune Gemma 4** | You want to fine-tune Gemma 4 for structured extraction from images (e.g., receipts) | Image + text dataset | VLM | [Fine-tune Gemma 4](guides/vlm/gemma4.md) |
Expand Down Expand Up @@ -120,7 +120,7 @@ See the [full benchmark results](performance-summary.md) for configuration detai

## Advanced Topics

Parallelism, precision, checkpointing strategies and experiment tracking.
Parallelism, precision, checkpointing strategies, and experiment tracking.

::::{grid} 1 2 2 3
:gutter: 1 1 1 2
Expand Down Expand Up @@ -231,9 +231,11 @@ performance-summary.md
Overview <model-coverage/overview.md>
Release Log <model-coverage/latest-models.md>
Large Language Models <model-coverage/llm/index.md>
Vison Language Models <model-coverage/vlm/index.md>
Vision Language Models <model-coverage/vlm/index.md>
Omni <model-coverage/omni/index.md>
Diffusion <model-coverage/diffusion/index.md>
Embedding Models <model-coverage/embedding/index.md>
Reranking Models <model-coverage/reranker/index.md>
::::

::::{toctree}
Expand Down
53 changes: 53 additions & 0 deletions docs/model-coverage/embedding/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
(embedding-models)=

# Embedding Models

## Introduction

Embedding models convert text into dense vector representations for semantic search, dense retrieval, retrieval-augmented generation (RAG), and classification. NeMo AutoModel supports optimized bidirectional Llama bi-encoders and falls back to Hugging Face `AutoModel` for other encoder backbones.

For cross-encoder pairwise scoring, see [Reranking Models](../reranker/index.md).

Embedding models use bi-encoders to produce dense representations for queries and documents independently. They are the standard path for embedding generation and first-stage dense retrieval.

### Optimized Backbones (Bidirectional Attention)

| Owner | Model | Architecture | Auto Class | Tasks |
|---|---|---|---|---|
| Meta | [Llama (Bidirectional)](meta/llama-bidirectional.md) | `LlamaBidirectionalModel` | [`NeMoAutoModelBiEncoder`](https://github.com/NVIDIA-NeMo/Automodel/blob/8dc00dcb4a35c2413c52c6e7eb7ac8f1c24836aa/nemo_automodel/_transformers/auto_model.py#L991) | Embedding, Dense Retrieval |
| NVIDIA | [Llama-Embed-Nemotron-8B](nvidia/llama-embed-nemotron-8b.md) | `LlamaBidirectionalModel` | [`NeMoAutoModelBiEncoder`](https://github.com/NVIDIA-NeMo/Automodel/blob/8dc00dcb4a35c2413c52c6e7eb7ac8f1c24836aa/nemo_automodel/_transformers/auto_model.py#L991) | Embedding, Dense Retrieval |

### Hugging Face Auto Backbones

Any Hugging Face model that can be loaded with `AutoModel` can be used as an embedding backbone. This fallback path uses the model's native attention; no bidirectional conversion is applied.

## Example Recipes

| Recipe | Description |
|---|---|
| {download}`llama3_2_1b.yaml <../../../examples/retrieval/bi_encoder/llama3_2_1b.yaml>` | Bi-encoder — Llama 3.2 1B embedding model |
| {download}`llama_embed_nemotron_8b.yaml <../../../examples/retrieval/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml>` | Bi-encoder — Llama-Embed-Nemotron-8B reproduction recipe |

## Supported Workflows

- **Fine-tuning (Bi-Encoder):** Contrastive learning on query-document pairs to produce embedding models
- **LoRA/PEFT:** Parameter-efficient fine-tuning for embedding backbones
- **ONNX Export:** Export trained embedding models for deployment (case by case, model dependent)

## Dataset

Retrieval fine-tuning requires query-document pairs: each example is a query paired with one positive document and one or more negative documents. Both inline JSONL and corpus ID-based JSON formats are supported. See the [Retrieval Dataset](../../guides/llm/retrieval-dataset.md) guide.

<!--
@akoumpa: uncomment this when finetune guide is published.
## Train Embedding Models

For a complete walkthrough of training configuration, model-specific settings, and launch commands, see the [Embedding and Reranking Fine-Tuning Guide](../../guides/retrieval/finetune.md).
-->

```{toctree}
:hidden:

meta/llama-bidirectional
nvidia/llama-embed-nemotron-8b
```
112 changes: 112 additions & 0 deletions docs/model-coverage/embedding/meta/llama-bidirectional.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Llama (Bidirectional) for Embedding

NeMo AutoModel provides a bidirectional variant of [Meta's Llama](https://www.llama.com/) for embedding and dense retrieval tasks. Unlike the standard causal (left-to-right) Llama used for text generation, this variant uses non-causal **bidirectional attention**, so each token can attend to both past and future tokens in the sequence, producing richer representations for semantic similarity and dense retrieval.

For the cross-encoder variant, see [Llama (Bidirectional) for Reranking](../../reranker/meta/llama-bidirectional.md).
For the NVIDIA model page, see [Llama-Embed-Nemotron-8B](../nvidia/llama-embed-nemotron-8b.md).

:::{card}
| | |
|---|---|
| **Tasks** | Embedding, Dense Retrieval |
| **Architecture** | `LlamaBidirectionalModel` |
| **Parameters** | 1B – 8B |
| **HF Org** | [meta-llama](https://huggingface.co/meta-llama) |
:::

## Available Models

Any Llama checkpoint can be loaded as a bidirectional backbone. The following configurations are tested:

- **Llama 3.2 1B** — fast iteration, fits on a single GPU
- **Llama 3.1 8B** — higher-quality embeddings for production use

## Embedding Models

The bidirectional bi-encoder path is used for embedding generation and dense retrieval.

| Architecture | Task | Auto Class | Description |
|---|---|---|---|
| `LlamaBidirectionalModel` | Embedding | [`NeMoAutoModelBiEncoder`](https://github.com/NVIDIA-NeMo/Automodel/blob/8dc00dcb4a35c2413c52c6e7eb7ac8f1c24836aa/nemo_automodel/_transformers/auto_model.py#L991) | Bidirectional Llama with pooling for dense embeddings |

## Pooling Strategies

The bi-encoder supports multiple pooling strategies to aggregate token representations into a single embedding vector:

| Strategy | Description |
|---|---|
| `avg` | Average of all token hidden states (default) |
| `cls` | First token hidden state |
| `last` | Last non-padding token hidden state |
| `weighted_avg` | Weighted average of token hidden states |
| `colbert` | No pooling — token-level embeddings (ColBERT-style) |

## Example HF Models

| Model | HF ID |
|---|---|
| Llama 3.2 1B | [`meta-llama/Llama-3.2-1B`](https://huggingface.co/meta-llama/Llama-3.2-1B) |
| Llama 3.1 8B | [`meta-llama/Llama-3.1-8B`](https://huggingface.co/meta-llama/Llama-3.1-8B) |

## Example Recipes

| Recipe | Description |
|---|---|
| {download}`llama3_2_1b.yaml <../../../../examples/retrieval/bi_encoder/llama3_2_1b.yaml>` | Bi-encoder — Llama 3.2 1B embedding model |

## Try with NeMo AutoModel

**1. Install NeMo AutoModel**. Refer to the ([Installation Guide](../../../guides/installation.md)) for information:

```bash
uv pip install nemo-automodel
```

**2. Clone the repo** to get the example recipes:

```bash
git clone https://github.com/NVIDIA-NeMo/Automodel.git
cd Automodel
```

**3. Run the recipe** from inside the repo:

```bash
automodel examples/retrieval/bi_encoder/llama3_2_1b.yaml --nproc-per-node 8
```

:::{dropdown} Run with Docker
**1. Pull the container** and mount a checkpoint directory:

```bash
docker run --gpus all -it --rm \
--shm-size=8g \
-v $(pwd)/checkpoints:/opt/Automodel/checkpoints \
nvcr.io/nvidia/nemo-automodel:26.02.00
```

**2. Navigate** to the AutoModel directory (where the recipes are):

```bash
cd /opt/Automodel
```

**3. Run the recipe**:

```bash
automodel examples/retrieval/bi_encoder/llama3_2_1b.yaml --nproc-per-node 8
```
:::

See the [Installation Guide](../../../guides/installation.md).

<!-- TODO: uncomment when finetune guide is published.
## Fine-Tuning

See the [Embedding and Reranking Fine-Tuning Guide](../../../guides/retrieval/finetune.md) for bi-encoder training instructions, including LoRA and PEFT configuration.
-->

## Hugging Face Model Cards

- [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)
- [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
75 changes: 75 additions & 0 deletions docs/model-coverage/embedding/nvidia/llama-embed-nemotron-8b.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Llama-Embed-Nemotron-8B

[Llama-Embed-Nemotron-8B](https://huggingface.co/nvidia/llama-embed-nemotron-8b) is NVIDIA's text embedding model for retrieval, semantic similarity, classification, and multilingual retrieval workloads. In NeMo AutoModel, it is reproduced with the bidirectional Llama bi-encoder backbone.

For architecture-level details such as bidirectional attention and pooling strategies, see [Llama (Bidirectional)](../meta/llama-bidirectional.md).

:::{card}
| | |
|---|---|
| **Task** | Embedding, Dense Retrieval |
| **Architecture** | `LlamaBidirectionalModel` |
| **Parameters** | 8B |
| **HF Org** | [nvidia](https://huggingface.co/nvidia) |
:::

## Available Models

- **Llama-Embed-Nemotron-8B**

## Architecture

- `LlamaBidirectionalModel`

## Example HF Models

| Model | HF ID |
|---|---|
| Llama-Embed-Nemotron-8B | [`nvidia/llama-embed-nemotron-8b`](https://huggingface.co/nvidia/llama-embed-nemotron-8b) |

## Example Recipes

| Recipe | Description |
|---|---|
| {download}`llama_embed_nemotron_8b.yaml <../../../../examples/retrieval/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml>` | Bi-encoder — reproduction recipe for Llama-Embed-Nemotron-8B |

## Try with NeMo AutoModel

**1. Install NeMo AutoModel**. Refer to the ([Installation Guide](../../../guides/installation.md)) for information:

```bash
uv pip install nemo-automodel
```

**2. Clone the repo** to get the example recipes:

```bash
git clone https://github.com/NVIDIA-NeMo/Automodel.git
cd Automodel
```

**3. Prepare the dataset** used by the reproduction recipe:

```bash
uv run python examples/retrieval/bi_encoder/llama_embed_nemotron_8b/data_preparation.py \
--download-path ./embed_nemotron_dataset_v1
```

**4. Run the recipe** from inside the repo:

```bash
automodel examples/retrieval/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml --nproc-per-node 8
```

See the [Installation Guide](../../../guides/installation.md).

<!-- TODO: uncomment when finetune guide is published.
## Fine-Tuning

See the [Embedding and Reranking Fine-Tuning Guide](../../../guides/retrieval/finetune.md) for bi-encoder training instructions, including LoRA and PEFT configuration.
-->

## Hugging Face Model Cards

- [nvidia/llama-embed-nemotron-8b](https://huggingface.co/nvidia/llama-embed-nemotron-8b)
- [nvidia/embed-nemotron-dataset-v1](https://huggingface.co/datasets/nvidia/embed-nemotron-dataset-v1)
7 changes: 4 additions & 3 deletions docs/model-coverage/overview.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Model Coverage Overview

NeMo AutoModel integrates with Hugging Face `transformers`. Any LLM or VLM that can be instantiated through `transformers` can also be used via NeMo AutoModel, subject to runtime, third-party software dependencies, and feature compatibility.
NeMo AutoModel integrates with Hugging Face `transformers`. Any LLM or VLM that can be instantiated through `transformers` can also be used using NeMo AutoModel, subject to runtime, third-party software dependencies, and feature compatibility.

## Supported Hugging Face Auto Classes

Expand All @@ -10,6 +10,8 @@ NeMo AutoModel integrates with Hugging Face `transformers`. Any LLM or VLM that
| `AutoModelForImageTextToText` | Image-Text-to-Text (VLM) | Supported | See [VLM model list](vlm/index.md). |
| `AutoModelForSequenceClassification` | Sequence Classification | WIP | Early support; interfaces may change. |
| Diffusers Pipelines | Diffusion Generation (T2I, T2V) | Supported | See [Diffusion model list](diffusion/index.md). |
| `NeMoAutoModelBiEncoder` | Embedding Models | Supported | See [Embedding model list](embedding/index.md). |
| `NeMoAutoModelCrossEncoder` | Reranking Models | Supported | See [Reranking model list](reranker/index.md). |

## Release Log

Expand All @@ -29,7 +31,6 @@ The table below tracks when model support and key features were added across NeM
- New models released on the Hugging Face Hub may require the latest `transformers` version, necessitating a package upgrade.
- We are working on a CI pipeline that automatically bumps the supported `transformers` version when a new release is detected, enabling even faster day-0 support.

**Note:** To use newly released models, you may need to upgrade your NeMo AutoModel installation — just as you would upgrade `transformers` to access the latest models. AutoModel mirrors the familiar `transformers` `Auto*` APIs while adding optional performance accelerations and distributed training features.

## Custom Model Registry

Expand All @@ -40,4 +41,4 @@ NeMo AutoModel includes a custom model registry that allows teams to:

## Having Issues?

If a model from the Hub doesn't work as expected, see the [Troubleshooting](troubleshooting.md) guide for common issues and solutions.
If a model from the Hub doesn't work as expected, see the [Troubleshooting Guide](troubleshooting.md) for common issues and solutions.
40 changes: 40 additions & 0 deletions docs/model-coverage/reranker/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
(reranking-models)=

# Reranking Models

## Introduction

Reranking models use cross-encoders to score a query-document pair jointly. They are typically used after an embedding model has produced an initial candidate set. NeMo AutoModel supports optimized bidirectional Llama rerankers and falls back to Hugging Face `AutoModelForSequenceClassification` for other architectures.

For first-stage dense retrieval, see [Embedding Models](../embedding/index.md).

## Optimized Backbones (Bidirectional Attention)

| Owner | Model | Architecture | Wrapper Class | Tasks |
|---|---|---|---|---|
| Meta | [Llama (Bidirectional)](meta/llama-bidirectional.md) | `LlamaBidirectionalForSequenceClassification` | `NeMoAutoModelCrossEncoder` | Reranking |

## Hugging Face Auto Backbones

Any Hugging Face model loadable using `AutoModelForSequenceClassification` can be used as a reranking backbone. This fallback path uses the model's native attention; no bidirectional conversion is applied.

## Supported Workflows

- **Fine-tuning (Cross-Encoder):** Cross-entropy training on query-document pairs to produce rerankers
- **LoRA/PEFT:** Parameter-efficient fine-tuning for reranking backbones

## Dataset

Retrieval fine-tuning requires query-document pairs: each example is a query paired with one positive document and one or more negative documents. Both inline JSONL and corpus ID-based JSON formats are supported. See the [Retrieval Dataset](../../guides/llm/retrieval-dataset.md) guide.

<!-- TODO: uncomment when finetune guide is published.
## Train Reranking Models

For a complete walkthrough of training configuration, model-specific settings, and launch commands, see the [Embedding and Reranking Fine-Tuning Guide](../../guides/retrieval/finetune.md).
-->

```{toctree}
:hidden:

meta/llama-bidirectional
```
Loading
Loading