Skip to content

Latest commit

 

History

History
112 lines (78 loc) · 3.8 KB

File metadata and controls

112 lines (78 loc) · 3.8 KB

Llama (Bidirectional) for Embedding

NeMo AutoModel provides a bidirectional variant of Meta's Llama for embedding and dense retrieval tasks. Unlike the standard causal (left-to-right) Llama used for text generation, this variant uses non-causal bidirectional attention, so each token can attend to both past and future tokens in the sequence, producing richer representations for semantic similarity and dense retrieval.

For the cross-encoder variant, see Llama (Bidirectional) for Reranking. For the NVIDIA model page, see Llama-Embed-Nemotron-8B.

:::{card}

Tasks Embedding, Dense Retrieval
Architecture LlamaBidirectionalModel
Parameters 1B – 8B
HF Org meta-llama
:::

Available Models

Any Llama checkpoint can be loaded as a bidirectional backbone. The following configurations are tested:

  • Llama 3.2 1B — fast iteration, fits on a single GPU
  • Llama 3.1 8B — higher-quality embeddings for production use

Embedding Models

The bidirectional bi-encoder path is used for embedding generation and dense retrieval.

Architecture Task Auto Class Description
LlamaBidirectionalModel Embedding NeMoAutoModelBiEncoder Bidirectional Llama with pooling for dense embeddings

Pooling Strategies

The bi-encoder supports multiple pooling strategies to aggregate token representations into a single embedding vector:

Strategy Description
avg Average of all token hidden states (default)
cls First token hidden state
last Last non-padding token hidden state
weighted_avg Weighted average of token hidden states
colbert No pooling — token-level embeddings (ColBERT-style)

Example HF Models

Model HF ID
Llama 3.2 1B meta-llama/Llama-3.2-1B
Llama 3.1 8B meta-llama/Llama-3.1-8B

Example Recipes

Recipe Description
{download}llama3_2_1b.yaml <../../../../examples/retrieval/bi_encoder/llama3_2_1b.yaml> Bi-encoder — Llama 3.2 1B embedding model

Try with NeMo AutoModel

1. Install NeMo AutoModel. Refer to the (Installation Guide) for information:

uv pip install nemo-automodel

2. Clone the repo to get the example recipes:

git clone https://github.com/NVIDIA-NeMo/Automodel.git
cd Automodel

3. Run the recipe from inside the repo:

automodel examples/retrieval/bi_encoder/llama3_2_1b.yaml --nproc-per-node 8

:::{dropdown} Run with Docker 1. Pull the container and mount a checkpoint directory:

docker run --gpus all -it --rm \
  --shm-size=8g \
  -v $(pwd)/checkpoints:/opt/Automodel/checkpoints \
  nvcr.io/nvidia/nemo-automodel:26.02.00

2. Navigate to the AutoModel directory (where the recipes are):

cd /opt/Automodel

3. Run the recipe:

automodel examples/retrieval/bi_encoder/llama3_2_1b.yaml --nproc-per-node 8

:::

See the Installation Guide.

Hugging Face Model Cards