NVIDIA-NeMo · akoumpa · Apr 14, 2026 · Apr 14, 2026 · Apr 14, 2026 · Apr 14, 2026
@@ -26,7 +26,7 @@ content_type: index
 
 # NeMo AutoModel Documentation
 
-PyTorch-native training that scales from 1 GPU to thousands with a single config change. Load any Hugging Face model, point at your data, and start training -- no checkpoint conversion, no boilerplate.
+PyTorch-native training that scales from 1 GPU to thousands with a single config change. Load any Hugging Face model, point at your data, and start training; no checkpoint conversion and no boilerplate.
 **Quick links:** [🤗 HF Compatible](guides/huggingface-api-compatibility.md) | [🚀 Performance](performance-summary.md) | [📐 Scalability](about/key-features.md) | [🎯 SFT & PEFT](guides/llm/finetune.md) | [🎨 Diffusion](guides/diffusion/finetune.md) | [👁️ VLM](guides/vlm/gemma4.md)
 
 ::::{grid} 2 2 2 2
@@ -84,12 +84,12 @@ New models are added regularly. Pick a model below to start fine-tuning, or see
 
 ## Recipes & Guides
 
-Find the right guide for your task -- fine-tuning, pretraining, distillation, diffusion, and more.
+Find the right guide for your task: fine-tuning, pretraining, distillation, diffusion, and more.
 
 | I want to...                | Choose this when...                                                                 | Input Data                                        | Model     | Guide                                                     |
 | --------------------------- | ----------------------------------------------------------------------------------- | ------------------------------------------------- | --------- | --------------------------------------------------------- |
 | **SFT (full fine-tune)**    | You need maximum accuracy and have the GPU budget to update all weights             | Instruction / chat dataset                        | LLM       | [Start fine-tuning](guides/llm/finetune.md)               |
-| **PEFT (LoRA)**             | You want to fine-tune on limited GPU memory; updates <1 % of parameters             | Instruction / chat dataset                        | LLM       | [Start LoRA](guides/llm/finetune.md)        |
+| **PEFT (LoRA)**             | You want to fine-tune on limited GPU memory; updates <1 % of parameters             | Instruction / chat dataset                        | LLM       | [Start LoRA](guides/llm/finetune.md)     |
 | **Tool / function calling** | Your model needs to call APIs or tools with structured arguments                    | Function-calling dataset (queries + tool schemas) | LLM       | [Add tool calling](guides/llm/toolcalling.md)             |
 | **Fine-tune VLM**           | Your task involves both images and text (e.g., visual QA, captioning)               | Image + text dataset                              | VLM       | [Fine-tune VLM](guides/omni/gemma3-3n.md)                 |
 | **Fine-tune Gemma 4**       | You want to fine-tune Gemma 4 for structured extraction from images (e.g., receipts) | Image + text dataset                              | VLM       | [Fine-tune Gemma 4](guides/vlm/gemma4.md)                 |
@@ -120,7 +120,7 @@ See the [full benchmark results](performance-summary.md) for configuration detai
 
 ## Advanced Topics
 
-Parallelism, precision, checkpointing strategies and experiment tracking.
+Parallelism, precision, checkpointing strategies, and experiment tracking.
 
 ::::{grid} 1 2 2 3
 :gutter: 1 1 1 2
@@ -231,9 +231,11 @@ performance-summary.md
 Overview <model-coverage/overview.md>
 Release Log <model-coverage/latest-models.md>
 Large Language Models <model-coverage/llm/index.md>
-Vison Language Models <model-coverage/vlm/index.md>
+Vision Language Models <model-coverage/vlm/index.md>
 Omni <model-coverage/omni/index.md>
 Diffusion <model-coverage/diffusion/index.md>
+Embedding Models <model-coverage/embedding/index.md>
+Reranking Models <model-coverage/reranker/index.md>
 ::::
 
 ::::{toctree}

@@ -0,0 +1,53 @@
+(embedding-models)=
+
+# Embedding Models
+
+## Introduction
+
+Embedding models convert text into dense vector representations for semantic search, dense retrieval, retrieval-augmented generation (RAG), and classification. NeMo AutoModel supports optimized bidirectional Llama bi-encoders and falls back to Hugging Face `AutoModel` for other encoder backbones.
+
+For cross-encoder pairwise scoring, see [Reranking Models](../reranker/index.md).
+
+Embedding models use bi-encoders to produce dense representations for queries and documents independently. They are the standard path for embedding generation and first-stage dense retrieval.
+
+### Optimized Backbones (Bidirectional Attention)
+
+| Owner | Model | Architecture | Auto Class | Tasks |
+|---|---|---|---|---|
+| Meta | [Llama (Bidirectional)](meta/llama-bidirectional.md) | `LlamaBidirectionalModel` | [`NeMoAutoModelBiEncoder`](https://github.com/NVIDIA-NeMo/Automodel/blob/8dc00dcb4a35c2413c52c6e7eb7ac8f1c24836aa/nemo_automodel/_transformers/auto_model.py#L991) | Embedding, Dense Retrieval |
+| NVIDIA | [Llama-Embed-Nemotron-8B](nvidia/llama-embed-nemotron-8b.md) | `LlamaBidirectionalModel` | [`NeMoAutoModelBiEncoder`](https://github.com/NVIDIA-NeMo/Automodel/blob/8dc00dcb4a35c2413c52c6e7eb7ac8f1c24836aa/nemo_automodel/_transformers/auto_model.py#L991) | Embedding, Dense Retrieval |
+
+### Hugging Face Auto Backbones
+
+Any Hugging Face model that can be loaded with `AutoModel` can be used as an embedding backbone. This fallback path uses the model's native attention; no bidirectional conversion is applied.
+
+## Example Recipes
+
+| Recipe | Description |
+|---|---|
+| {download}`llama3_2_1b.yaml <../../../examples/retrieval/bi_encoder/llama3_2_1b.yaml>` | Bi-encoder — Llama 3.2 1B embedding model |
+| {download}`llama_embed_nemotron_8b.yaml <../../../examples/retrieval/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml>` | Bi-encoder — Llama-Embed-Nemotron-8B reproduction recipe |
+
+## Supported Workflows
+
+- **Fine-tuning (Bi-Encoder):** Contrastive learning on query-document pairs to produce embedding models
+- **LoRA/PEFT:** Parameter-efficient fine-tuning for embedding backbones
+- **ONNX Export:** Export trained embedding models for deployment (case by case, model dependent)
+
+## Dataset
+
+Retrieval fine-tuning requires query-document pairs: each example is a query paired with one positive document and one or more negative documents. Both inline JSONL and corpus ID-based JSON formats are supported. See the [Retrieval Dataset](../../guides/llm/retrieval-dataset.md) guide.
+
+<!--
+@akoumpa: uncomment this when finetune guide is published.
+## Train Embedding Models
+
+For a complete walkthrough of training configuration, model-specific settings, and launch commands, see the [Embedding and Reranking Fine-Tuning Guide](../../guides/retrieval/finetune.md).
+-->
+
+```{toctree}
+:hidden:
+
+meta/llama-bidirectional
+nvidia/llama-embed-nemotron-8b
+```
@@ -0,0 +1,112 @@
+# Llama (Bidirectional) for Embedding
+
+NeMo AutoModel provides a bidirectional variant of [Meta's Llama](https://www.llama.com/) for embedding and dense retrieval tasks. Unlike the standard causal (left-to-right) Llama used for text generation, this variant uses non-causal **bidirectional attention**, so each token can attend to both past and future tokens in the sequence, producing richer representations for semantic similarity and dense retrieval.
+
+For the cross-encoder variant, see [Llama (Bidirectional) for Reranking](../../reranker/meta/llama-bidirectional.md).
+For the NVIDIA model page, see [Llama-Embed-Nemotron-8B](../nvidia/llama-embed-nemotron-8b.md).
+
+:::{card}
+| | |
+|---|---|
+| **Tasks** | Embedding, Dense Retrieval |
+| **Architecture** | `LlamaBidirectionalModel` |
+| **Parameters** | 1B – 8B |
+| **HF Org** | [meta-llama](https://huggingface.co/meta-llama) |
+:::
+
+## Available Models
+
+Any Llama checkpoint can be loaded as a bidirectional backbone. The following configurations are tested:
+
+- **Llama 3.2 1B** — fast iteration, fits on a single GPU
+- **Llama 3.1 8B** — higher-quality embeddings for production use
+
+## Embedding Models
+
+The bidirectional bi-encoder path is used for embedding generation and dense retrieval.
+
+| Architecture | Task | Auto Class | Description |
+|---|---|---|---|
+| `LlamaBidirectionalModel` | Embedding | [`NeMoAutoModelBiEncoder`](https://github.com/NVIDIA-NeMo/Automodel/blob/8dc00dcb4a35c2413c52c6e7eb7ac8f1c24836aa/nemo_automodel/_transformers/auto_model.py#L991) | Bidirectional Llama with pooling for dense embeddings |
+
+## Pooling Strategies
+
+The bi-encoder supports multiple pooling strategies to aggregate token representations into a single embedding vector:
+
+| Strategy | Description |
+|---|---|
+| `avg` | Average of all token hidden states (default) |
+| `cls` | First token hidden state |
+| `last` | Last non-padding token hidden state |
+| `weighted_avg` | Weighted average of token hidden states |
+| `colbert` | No pooling — token-level embeddings (ColBERT-style) |
+
+## Example HF Models
+
+| Model | HF ID |
+|---|---|
+| Llama 3.2 1B | [`meta-llama/Llama-3.2-1B`](https://huggingface.co/meta-llama/Llama-3.2-1B) |
+| Llama 3.1 8B | [`meta-llama/Llama-3.1-8B`](https://huggingface.co/meta-llama/Llama-3.1-8B) |
+
+## Example Recipes
+
+| Recipe | Description |
+|---|---|
+| {download}`llama3_2_1b.yaml <../../../../examples/retrieval/bi_encoder/llama3_2_1b.yaml>` | Bi-encoder — Llama 3.2 1B embedding model |
+
+## Try with NeMo AutoModel
+
+**1. Install NeMo AutoModel**. Refer to the ([Installation Guide](../../../guides/installation.md)) for information:
+
+```bash
+uv pip install nemo-automodel
+```
+
+**2. Clone the repo** to get the example recipes:
+
+```bash
+git clone https://github.com/NVIDIA-NeMo/Automodel.git
+cd Automodel
+```
+
+**3. Run the recipe** from inside the repo:
+
+```bash
+automodel examples/retrieval/bi_encoder/llama3_2_1b.yaml --nproc-per-node 8
+```
+
+:::{dropdown} Run with Docker
+**1. Pull the container** and mount a checkpoint directory:
+
+```bash
+docker run --gpus all -it --rm \
+  --shm-size=8g \
+  -v $(pwd)/checkpoints:/opt/Automodel/checkpoints \
+  nvcr.io/nvidia/nemo-automodel:26.02.00
+```
+
+**2. Navigate** to the AutoModel directory (where the recipes are):
+
+```bash
+cd /opt/Automodel
+```
+
+**3. Run the recipe**:
+
+```bash
+automodel examples/retrieval/bi_encoder/llama3_2_1b.yaml --nproc-per-node 8
+```
+:::
+
+See the [Installation Guide](../../../guides/installation.md).
+
+<!-- TODO: uncomment when finetune guide is published.
+## Fine-Tuning
+
+See the [Embedding and Reranking Fine-Tuning Guide](../../../guides/retrieval/finetune.md) for bi-encoder training instructions, including LoRA and PEFT configuration.
+-->
+
+## Hugging Face Model Cards
+
+- [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)
+- [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
@@ -0,0 +1,75 @@
+# Llama-Embed-Nemotron-8B
+
+[Llama-Embed-Nemotron-8B](https://huggingface.co/nvidia/llama-embed-nemotron-8b) is NVIDIA's text embedding model for retrieval, semantic similarity, classification, and multilingual retrieval workloads. In NeMo AutoModel, it is reproduced with the bidirectional Llama bi-encoder backbone.
+
+For architecture-level details such as bidirectional attention and pooling strategies, see [Llama (Bidirectional)](../meta/llama-bidirectional.md).
+
+:::{card}
+| | |
+|---|---|
+| **Task** | Embedding, Dense Retrieval |
+| **Architecture** | `LlamaBidirectionalModel` |
+| **Parameters** | 8B |
+| **HF Org** | [nvidia](https://huggingface.co/nvidia) |
+:::
+
+## Available Models
+
+- **Llama-Embed-Nemotron-8B**
+
+## Architecture
+
+- `LlamaBidirectionalModel`
+
+## Example HF Models
+
+| Model | HF ID |
+|---|---|
+| Llama-Embed-Nemotron-8B | [`nvidia/llama-embed-nemotron-8b`](https://huggingface.co/nvidia/llama-embed-nemotron-8b) |
+
+## Example Recipes
+
+| Recipe | Description |
+|---|---|
+| {download}`llama_embed_nemotron_8b.yaml <../../../../examples/retrieval/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml>` | Bi-encoder — reproduction recipe for Llama-Embed-Nemotron-8B |
+
+## Try with NeMo AutoModel
+
+**1. Install NeMo AutoModel**. Refer to the ([Installation Guide](../../../guides/installation.md)) for information:
+
+```bash
+uv pip install nemo-automodel
+```
+
+**2. Clone the repo** to get the example recipes:
+
+```bash
+git clone https://github.com/NVIDIA-NeMo/Automodel.git
+cd Automodel
+```
+
+**3. Prepare the dataset** used by the reproduction recipe:
+
+```bash
+uv run python examples/retrieval/bi_encoder/llama_embed_nemotron_8b/data_preparation.py \
+  --download-path ./embed_nemotron_dataset_v1
+```
+
+**4. Run the recipe** from inside the repo:
+
+```bash
+automodel examples/retrieval/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml --nproc-per-node 8
+```
+
+See the [Installation Guide](../../../guides/installation.md).
+
+<!-- TODO: uncomment when finetune guide is published.
+## Fine-Tuning
+
+See the [Embedding and Reranking Fine-Tuning Guide](../../../guides/retrieval/finetune.md) for bi-encoder training instructions, including LoRA and PEFT configuration.
+-->
+
+## Hugging Face Model Cards
+
+- [nvidia/llama-embed-nemotron-8b](https://huggingface.co/nvidia/llama-embed-nemotron-8b)
+- [nvidia/embed-nemotron-dataset-v1](https://huggingface.co/datasets/nvidia/embed-nemotron-dataset-v1)
@@ -1,6 +1,6 @@
 # Model Coverage Overview
 
-NeMo AutoModel integrates with Hugging Face `transformers`. Any LLM or VLM that can be instantiated through `transformers` can also be used via NeMo AutoModel, subject to runtime, third-party software dependencies, and feature compatibility.
+NeMo AutoModel integrates with Hugging Face `transformers`. Any LLM or VLM that can be instantiated through `transformers` can also be used using NeMo AutoModel, subject to runtime, third-party software dependencies, and feature compatibility.
 
 ## Supported Hugging Face Auto Classes
 
@@ -10,6 +10,8 @@ NeMo AutoModel integrates with Hugging Face `transformers`. Any LLM or VLM that
 | `AutoModelForImageTextToText` | Image-Text-to-Text (VLM) | Supported | See [VLM model list](vlm/index.md). |
 | `AutoModelForSequenceClassification` | Sequence Classification | WIP | Early support; interfaces may change. |
 | Diffusers Pipelines | Diffusion Generation (T2I, T2V) | Supported | See [Diffusion model list](diffusion/index.md). |
+| `NeMoAutoModelBiEncoder` | Embedding Models | Supported | See [Embedding model list](embedding/index.md). |
+| `NeMoAutoModelCrossEncoder` | Reranking Models | Supported | See [Reranking model list](reranker/index.md). |
 
 ## Release Log
 
@@ -29,7 +31,6 @@ The table below tracks when model support and key features were added across NeM
 - New models released on the Hugging Face Hub may require the latest `transformers` version, necessitating a package upgrade.
 - We are working on a CI pipeline that automatically bumps the supported `transformers` version when a new release is detected, enabling even faster day-0 support.
 
-**Note:** To use newly released models, you may need to upgrade your NeMo AutoModel installation — just as you would upgrade `transformers` to access the latest models. AutoModel mirrors the familiar `transformers` `Auto*` APIs while adding optional performance accelerations and distributed training features.
 
 ## Custom Model Registry
 
@@ -40,4 +41,4 @@ NeMo AutoModel includes a custom model registry that allows teams to:
 
 ## Having Issues?
 
-If a model from the Hub doesn't work as expected, see the [Troubleshooting](troubleshooting.md) guide for common issues and solutions.
+If a model from the Hub doesn't work as expected, see the [Troubleshooting Guide](troubleshooting.md) for common issues and solutions.
@@ -0,0 +1,40 @@
+(reranking-models)=
+
+# Reranking Models
+
+## Introduction
+
+Reranking models use cross-encoders to score a query-document pair jointly. They are typically used after an embedding model has produced an initial candidate set. NeMo AutoModel supports optimized bidirectional Llama rerankers and falls back to Hugging Face `AutoModelForSequenceClassification` for other architectures.
+
+For first-stage dense retrieval, see [Embedding Models](../embedding/index.md).
+
+## Optimized Backbones (Bidirectional Attention)
+
+| Owner | Model | Architecture | Wrapper Class | Tasks |
+|---|---|---|---|---|
+| Meta | [Llama (Bidirectional)](meta/llama-bidirectional.md) | `LlamaBidirectionalForSequenceClassification` | `NeMoAutoModelCrossEncoder` | Reranking |
+
+## Hugging Face Auto Backbones
+
+Any Hugging Face model loadable using `AutoModelForSequenceClassification` can be used as a reranking backbone. This fallback path uses the model's native attention; no bidirectional conversion is applied.
+
+## Supported Workflows
+
+- **Fine-tuning (Cross-Encoder):** Cross-entropy training on query-document pairs to produce rerankers
+- **LoRA/PEFT:** Parameter-efficient fine-tuning for reranking backbones
+
+## Dataset
+
+Retrieval fine-tuning requires query-document pairs: each example is a query paired with one positive document and one or more negative documents. Both inline JSONL and corpus ID-based JSON formats are supported. See the [Retrieval Dataset](../../guides/llm/retrieval-dataset.md) guide.
+
+<!-- TODO: uncomment when finetune guide is published.
+## Train Reranking Models
+
+For a complete walkthrough of training configuration, model-specific settings, and launch commands, see the [Embedding and Reranking Fine-Tuning Guide](../../guides/retrieval/finetune.md).
+-->
+
+```{toctree}
+:hidden:
+
+meta/llama-bidirectional
+```