Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion src/docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -1157,6 +1157,7 @@
]
},
"oss/python/langchain/retrieval",
"oss/python/langchain/choosing-embeddings",
"oss/python/langchain/long-term-memory"
]
},
Expand Down Expand Up @@ -1286,7 +1287,8 @@
"oss/python/integrations/splitters/index",
"oss/python/integrations/embeddings/index",
"oss/python/integrations/vectorstores/index",
"oss/python/integrations/document_loaders/index"
"oss/python/integrations/document_loaders/index",
"oss/python/integrations/document_transformers/index"
]
}
]
Expand Down
129 changes: 129 additions & 0 deletions src/oss/langchain/choosing-embeddings.mdx
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to leave these changes be, but I get lots of questions about choosing embedding models personally, so I think this kind of page can be very helpful.

Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
---
title: Choose an embedding model
sidebarTitle: Choose an embedding model
description: Practical guidance for selecting a text embedding model for retrieval-augmented applications.
---

Embeddings are the foundation of most retrieval pipelines: text enters as vectors, queries are compared against those vectors, and retrieval quality is bounded by the quality of that comparison. There is no single "best" embedding model; the right choice depends on your data, your latency and cost budget, and where the embeddings will run.

## Four common deployment patterns

In practice, most teams converge on one of four patterns:

1. Hosted, flagship: OpenAI `text-embedding-3-large`, Cohere `embed-english-v3`, Google `gemini-embedding-001`, Voyage `voyage-3`. One API call, best-in-class quality out of the box, no local infrastructure. Per-call cost and a data-egress dependency.
2. Local, open-source: `BAAI/bge-*`, `mixedbread-ai/mxbai-embed-*`, `Qwen/Qwen3-Embedding-*`, `nomic-ai/modernbert-embed-*`, `sentence-transformers/all-*`. Download once, run anywhere. No per-call cost, data never leaves your environment. Likely slower on CPU than a hosted API at small scale; competitive or faster with a GPU.
3. Local, open-source, specialist: a fine-tuned model targeting your specific domain, language, or task. Starting from a strong open base (e.g. `BAAI/bge-m3`) and fine-tuning on even a few thousand in-domain query/document pairs often beats hosted flagships on retrieval accuracy for that domain.
4. Self-hosted at production scale: the same open models (base or fine-tuned) served via [Text Embeddings Inference (TEI)](/oss/integrations/embeddings/text_embeddings_inference) or Ollama. Gives you the economics of local inference with the horizontal scaling and API ergonomics of a hosted provider.

LangChain treats all four the same: you instantiate an `Embeddings` subclass and hand it to your vector store or retriever. Patterns (2) and (3) use `HuggingFaceEmbeddings`; pattern (4) uses `HuggingFaceEndpointEmbeddings` or `OllamaEmbeddings`.

## Factors to weigh

### Quality

Start from the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). MTEB benchmarks embedding models across retrieval, clustering, classification, and reranking tasks, and is the de-facto industry reference. Filter by your language(s) and by task (retrieval is the most common for RAG).

Leaderboard numbers don't always transfer, so run a small evaluation on your own data before committing. LangSmith has tooling for this; see the [evaluation guides](/langsmith/evaluation-concepts).

### Cost

Hosted embeddings typically price in the range of a few cents to ~$0.15 per million tokens. For a corpus embedded once and queried thousands of times a day, cost is often dominated by the query side.

Local inference has zero per-call cost but requires CPU (slow) or GPU (capital or cloud cost). The crossover is workload-dependent: low-volume personal projects are essentially free on CPU; for mid-volume production, a single GPU serving a local model via TEI often beats hosted on unit economics.

### Latency

Hosted embedding APIs add roughly 50-200ms of network latency per request. Local models on CPU take 10-100ms for a short query with a small model (`all-MiniLM-L6-v2`-class), and 50-500ms for larger models. On GPU, local inference is typically faster than a round-trip to a hosted API.

For batch indexing, latency per request matters less than throughput. TEI and multi-process local inference batch aggressively. Consider e.g. `encode_kwargs={"batch_size": 64}` or higher on `HuggingFaceEmbeddings` when running on GPU.

### Dimensionality

Embedding dimension affects vector store storage and query compute. Typical sizes:

- 384 (small Sentence Transformers models, `all-MiniLM-L6-v2`)
- 768 (mid-size ST models, `all-mpnet-base-v2`, `bge-base`)
- 1024 (`bge-large`, Cohere v3, Voyage)
- 1536 (OpenAI `text-embedding-3-small`, Qwen3-Embedding-0.6B)
- 3072+ (OpenAI `text-embedding-3-large`, Qwen3-Embedding-4B/8B)

Larger vectors are usually more accurate but consume more storage and query compute. Several modern models (OpenAI `text-embedding-3-*`, `mixedbread-ai/mxbai-embed-large-v1`, Matryoshka-trained ST models, Qwen3-Embedding) support **truncation**: slice the vector to a smaller dimension with graceful quality degradation. Useful for fitting more vectors into a smaller index.

### Context length

Most classic embedding models cap out at 512 tokens (`all-mpnet-base-v2`, classic BGE). Newer models support longer contexts:

- `nomic-ai/modernbert-embed-base`: 8192 tokens
- `Alibaba-NLP/gte-multilingual-base`: 8192 tokens
- `BAAI/bge-m3`: 8192 tokens
- OpenAI `text-embedding-3-*`: 8191 tokens

If your chunks are long (full-page technical docs, legal paragraphs), prefer long-context models. For short chunks the 512-token limit is rarely binding.

### Multilingual support

For multilingual retrieval, pick a model trained on your languages. Strong defaults:

- Open: `BAAI/bge-m3`, `intfloat/multilingual-e5-*`, `Alibaba-NLP/gte-multilingual-*`, `Qwen/Qwen3-Embedding-*` (via `HuggingFaceEmbeddings`)
- Hosted: Cohere `embed-multilingual-v3`, OpenAI `text-embedding-3-*`

### Query and document prompts

Several modern open models (E5, BGE, Qwen3-Embedding, GTE) are trained with different text prefixes for queries versus documents. Using the wrong prefix at query time is a common quality regression. When using `HuggingFaceEmbeddings`, pass prompts explicitly:

:::python
```python
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
model_name="intfloat/e5-large-v2",
encode_kwargs={"prompt": "passage: "},
query_encode_kwargs={"prompt": "query: "},
)
```
:::
:::js
The `query_encode_kwargs` API shown above is Python-specific. In JavaScript, hosted providers (OpenAI, Cohere, Google) handle query and document prompts internally; for local inference, prepend the prefix manually to your queries and documents before embedding.
:::

Check each model's card on Hugging Face for the recommended prompt strings.

### Licensing

Most popular open embedding models are permissively licensed (Apache 2.0, MIT). A few recent specialist models require a commercial license for production use. Check each model's license before shipping.

## Beyond single-vector dense embeddings

A single dense vector per chunk is the default, but not the only option.

### Rerankers

Vector search returns the top-k most similar chunks by embedding distance, a cheap approximation of relevance. A reranker scores each `(query, chunk)` pair directly using a cross-encoder, producing more accurate ordering at the cost of one extra inference per chunk. Retrieve top-20 via embeddings, rerank down to top-5, one of the highest-impact quality improvements you can make. See [Cross Encoder Reranker](/oss/integrations/document_transformers/cross_encoder_reranker) for the local open-source path, or [Cohere Rerank](/oss/integrations/retrievers/cohere-reranker) for the hosted equivalent.

### Sparse and hybrid retrieval

Dense embeddings don't handle exact-match queries (product codes, named entities, code identifiers) as well as keyword-based indexes. Hybrid retrieval combines a dense index with BM25 or a sparse neural index (SPLADE, `BAAI/bge-m3`'s sparse output) to cover both cases.

### Late-interaction and multi-vector

ColBERT-style models produce a vector per token rather than per chunk, then score queries against documents via late interaction. This is typically more accurate than single-vector dense retrieval on complex queries, at the cost of higher storage and more complex indexing. Current open models in this space include `jinaai/jina-colbert-v2`, `answerdotai/answerai-colbert-small-v1`, and newer late-interaction variants such as `lightonai/DenseOn`. LangChain's built-in retrievers target single-vector embeddings; late interaction typically requires a specialist index (Vespa, Qdrant's multi-vector support, or PyLate).

## Starting points

If you just want a working starting point:

- Quick prototype, hosted: `OpenAIEmbeddings(model="text-embedding-3-small")`
- Quick prototype, local, no API key: `HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", encode_kwargs={"normalize_embeddings": True})`
- Production, hosted, quality-first: `VoyageAIEmbeddings(model="voyage-3")` or `OpenAIEmbeddings(model="text-embedding-3-large")`
- Production, open, quality-first: `HuggingFaceEmbeddings(model_name="BAAI/bge-m3", encode_kwargs={"normalize_embeddings": True})` served via TEI
- Multilingual, open: `HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-large")` with query and document prompts configured

Measure retrieval quality on your own data, then iterate.

## Related

- [Embedding model integrations](/oss/integrations/embeddings)
- [Sentence Transformers on Hugging Face](/oss/integrations/embeddings/sentence_transformers)
- [Text Embeddings Inference](/oss/integrations/embeddings/text_embeddings_inference)
- [Cross Encoder Reranker](/oss/integrations/document_transformers/cross_encoder_reranker)
- [Retrieval concept guide](/oss/langchain/retrieval)
33 changes: 33 additions & 0 deletions src/oss/langchain/rag.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import EmbeddingsTabsPy from '/snippets/embeddings-tabs-py.mdx';
import EmbeddingsTabsJS from '/snippets/embeddings-tabs-js.mdx';
import VectorstoreTabsPy from '/snippets/vectorstore-tabs-py.mdx';
import VectorstoreTabsJS from '/snippets/vectorstore-tabs-js.mdx';
import RerankerTabsPy from '/snippets/reranker-tabs-py.mdx';

## Overview

Expand Down Expand Up @@ -908,6 +909,38 @@ const agent = createAgent({
</Accordion>


## Improve retrieval with reranking

Vector search returns the top-k chunks by embedding similarity, which is a cheap approximation of relevance. A reranker is a second model (a cross-encoder or ranking API) that scores each `(query, chunk)` pair directly for more accurate ordering. The standard recipe is to retrieve a larger `k` from the vector store (e.g. 20) and then rerank down to the handful of documents you actually pass to the model. In practice, this is one of the highest-impact quality improvements you can make to a RAG pipeline, and with an open-source cross-encoder it runs locally on CPU for free.

:::python
Select a reranker:

<RerankerTabsPy />

Wrap the base retriever with `ContextualCompressionRetriever`:

```python
from langchain_classic.retrievers.contextual_compression import ContextualCompressionRetriever

base_retriever = vector_store.as_retriever(search_kwargs={"k": 20})

compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=base_retriever,
)

reranked_docs = compression_retriever.invoke("What is task decomposition?")
```

Use `compression_retriever` anywhere you previously used `vector_store.similarity_search`, e.g. in the RAG agent's retrieval tool, or in the RAG chain's `before_model` middleware.
:::
:::js
Rerankers are available in JavaScript via provider-specific integrations (see [Cohere Rerank](/oss/integrations/document_compressors/cohere_rerank) and [Mixedbread AI](/oss/integrations/document_compressors/mixedbread_ai)). The pattern is the same: wrap your base retriever with a document compressor.
:::

See the [Cross Encoder Reranker guide](/oss/integrations/document_transformers/cross_encoder_reranker) for more on local reranking with Hugging Face models.

## Security: indirect prompt injection

<Warning>
Expand Down
Loading
Loading