langchain-ai · Tom Aarsen (tomaarsen) · Apr 22, 2026 · Apr 22, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/src/docs.json b/src/docs.json
@@ -1157,6 +1157,7 @@
                           ]
                         },
                         "oss/python/langchain/retrieval",
+                        "oss/python/langchain/choosing-embeddings",
                         "oss/python/langchain/long-term-memory"
                       ]
                     },
@@ -1286,7 +1287,8 @@
                         "oss/python/integrations/splitters/index",
                         "oss/python/integrations/embeddings/index",
                         "oss/python/integrations/vectorstores/index",
-                        "oss/python/integrations/document_loaders/index"
+                        "oss/python/integrations/document_loaders/index",
+                        "oss/python/integrations/document_transformers/index"
                       ]
                     }
                   ]

diff --git a/src/oss/langchain/choosing-embeddings.mdx b/src/oss/langchain/choosing-embeddings.mdx
@@ -0,0 +1,129 @@
+---
+title: Choose an embedding model
+sidebarTitle: Choose an embedding model
+description: Practical guidance for selecting a text embedding model for retrieval-augmented applications.
+---
+
+Embeddings are the foundation of most retrieval pipelines: text enters as vectors, queries are compared against those vectors, and retrieval quality is bounded by the quality of that comparison. There is no single "best" embedding model; the right choice depends on your data, your latency and cost budget, and where the embeddings will run.
+
+## Four common deployment patterns
+
+In practice, most teams converge on one of four patterns:
+
+1. Hosted, flagship: OpenAI `text-embedding-3-large`, Cohere `embed-english-v3`, Google `gemini-embedding-001`, Voyage `voyage-3`. One API call, best-in-class quality out of the box, no local infrastructure. Per-call cost and a data-egress dependency.
+2. Local, open-source: `BAAI/bge-*`, `mixedbread-ai/mxbai-embed-*`, `Qwen/Qwen3-Embedding-*`, `nomic-ai/modernbert-embed-*`, `sentence-transformers/all-*`. Download once, run anywhere. No per-call cost, data never leaves your environment. Likely slower on CPU than a hosted API at small scale; competitive or faster with a GPU.
+3. Local, open-source, specialist: a fine-tuned model targeting your specific domain, language, or task. Starting from a strong open base (e.g. `BAAI/bge-m3`) and fine-tuning on even a few thousand in-domain query/document pairs often beats hosted flagships on retrieval accuracy for that domain.
+4. Self-hosted at production scale: the same open models (base or fine-tuned) served via [Text Embeddings Inference (TEI)](/oss/integrations/embeddings/text_embeddings_inference) or Ollama. Gives you the economics of local inference with the horizontal scaling and API ergonomics of a hosted provider.
+
+LangChain treats all four the same: you instantiate an `Embeddings` subclass and hand it to your vector store or retriever. Patterns (2) and (3) use `HuggingFaceEmbeddings`; pattern (4) uses `HuggingFaceEndpointEmbeddings` or `OllamaEmbeddings`.
+
+## Factors to weigh
+
+### Quality
+
+Start from the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). MTEB benchmarks embedding models across retrieval, clustering, classification, and reranking tasks, and is the de-facto industry reference. Filter by your language(s) and by task (retrieval is the most common for RAG).
+
+Leaderboard numbers don't always transfer, so run a small evaluation on your own data before committing. LangSmith has tooling for this; see the [evaluation guides](/langsmith/evaluation-concepts).
+
+### Cost
+
+Hosted embeddings typically price in the range of a few cents to ~$0.15 per million tokens. For a corpus embedded once and queried thousands of times a day, cost is often dominated by the query side.
+
+Local inference has zero per-call cost but requires CPU (slow) or GPU (capital or cloud cost). The crossover is workload-dependent: low-volume personal projects are essentially free on CPU; for mid-volume production, a single GPU serving a local model via TEI often beats hosted on unit economics.
+
+### Latency
+
+Hosted embedding APIs add roughly 50-200ms of network latency per request. Local models on CPU take 10-100ms for a short query with a small model (`all-MiniLM-L6-v2`-class), and 50-500ms for larger models. On GPU, local inference is typically faster than a round-trip to a hosted API.
+
+For batch indexing, latency per request matters less than throughput. TEI and multi-process local inference batch aggressively. Consider e.g. `encode_kwargs={"batch_size": 64}` or higher on `HuggingFaceEmbeddings` when running on GPU.
+
+### Dimensionality
+
+Embedding dimension affects vector store storage and query compute. Typical sizes:
+
+- 384 (small Sentence Transformers models, `all-MiniLM-L6-v2`)
+- 768 (mid-size ST models, `all-mpnet-base-v2`, `bge-base`)
+- 1024 (`bge-large`, Cohere v3, Voyage)
+- 1536 (OpenAI `text-embedding-3-small`, Qwen3-Embedding-0.6B)
+- 3072+ (OpenAI `text-embedding-3-large`, Qwen3-Embedding-4B/8B)
+
+Larger vectors are usually more accurate but consume more storage and query compute. Several modern models (OpenAI `text-embedding-3-*`, `mixedbread-ai/mxbai-embed-large-v1`, Matryoshka-trained ST models, Qwen3-Embedding) support **truncation**: slice the vector to a smaller dimension with graceful quality degradation. Useful for fitting more vectors into a smaller index.
+
+### Context length
+
+Most classic embedding models cap out at 512 tokens (`all-mpnet-base-v2`, classic BGE). Newer models support longer contexts:
+
+- `nomic-ai/modernbert-embed-base`: 8192 tokens
+- `Alibaba-NLP/gte-multilingual-base`: 8192 tokens
+- `BAAI/bge-m3`: 8192 tokens
+- OpenAI `text-embedding-3-*`: 8191 tokens
+
+If your chunks are long (full-page technical docs, legal paragraphs), prefer long-context models. For short chunks the 512-token limit is rarely binding.
+
+### Multilingual support
+
+For multilingual retrieval, pick a model trained on your languages. Strong defaults:
+
+- Open: `BAAI/bge-m3`, `intfloat/multilingual-e5-*`, `Alibaba-NLP/gte-multilingual-*`, `Qwen/Qwen3-Embedding-*` (via `HuggingFaceEmbeddings`)
+- Hosted: Cohere `embed-multilingual-v3`, OpenAI `text-embedding-3-*`
+
+### Query and document prompts
+
+Several modern open models (E5, BGE, Qwen3-Embedding, GTE) are trained with different text prefixes for queries versus documents. Using the wrong prefix at query time is a common quality regression. When using `HuggingFaceEmbeddings`, pass prompts explicitly:
+
+:::python
+```python
+from langchain_huggingface import HuggingFaceEmbeddings
+
+embeddings = HuggingFaceEmbeddings(
+    model_name="intfloat/e5-large-v2",
+    encode_kwargs={"prompt": "passage: "},
+    query_encode_kwargs={"prompt": "query: "},
+)
+```
+:::
+:::js
+The `query_encode_kwargs` API shown above is Python-specific. In JavaScript, hosted providers (OpenAI, Cohere, Google) handle query and document prompts internally; for local inference, prepend the prefix manually to your queries and documents before embedding.
+:::
+
+Check each model's card on Hugging Face for the recommended prompt strings.
+
+### Licensing
+
+Most popular open embedding models are permissively licensed (Apache 2.0, MIT). A few recent specialist models require a commercial license for production use. Check each model's license before shipping.
+
+## Beyond single-vector dense embeddings
+
+A single dense vector per chunk is the default, but not the only option.
+
+### Rerankers
+
+Vector search returns the top-k most similar chunks by embedding distance, a cheap approximation of relevance. A reranker scores each `(query, chunk)` pair directly using a cross-encoder, producing more accurate ordering at the cost of one extra inference per chunk. Retrieve top-20 via embeddings, rerank down to top-5, one of the highest-impact quality improvements you can make. See [Cross Encoder Reranker](/oss/integrations/document_transformers/cross_encoder_reranker) for the local open-source path, or [Cohere Rerank](/oss/integrations/retrievers/cohere-reranker) for the hosted equivalent.
+
+### Sparse and hybrid retrieval
+
+Dense embeddings don't handle exact-match queries (product codes, named entities, code identifiers) as well as keyword-based indexes. Hybrid retrieval combines a dense index with BM25 or a sparse neural index (SPLADE, `BAAI/bge-m3`'s sparse output) to cover both cases.
+
+### Late-interaction and multi-vector
+
+ColBERT-style models produce a vector per token rather than per chunk, then score queries against documents via late interaction. This is typically more accurate than single-vector dense retrieval on complex queries, at the cost of higher storage and more complex indexing. Current open models in this space include `jinaai/jina-colbert-v2`, `answerdotai/answerai-colbert-small-v1`, and newer late-interaction variants such as `lightonai/DenseOn`. LangChain's built-in retrievers target single-vector embeddings; late interaction typically requires a specialist index (Vespa, Qdrant's multi-vector support, or PyLate).
+
+## Starting points
+
+If you just want a working starting point:
+
+- Quick prototype, hosted: `OpenAIEmbeddings(model="text-embedding-3-small")`
+- Quick prototype, local, no API key: `HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", encode_kwargs={"normalize_embeddings": True})`
+- Production, hosted, quality-first: `VoyageAIEmbeddings(model="voyage-3")` or `OpenAIEmbeddings(model="text-embedding-3-large")`
+- Production, open, quality-first: `HuggingFaceEmbeddings(model_name="BAAI/bge-m3", encode_kwargs={"normalize_embeddings": True})` served via TEI
+- Multilingual, open: `HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-large")` with query and document prompts configured
+
+Measure retrieval quality on your own data, then iterate.
+
+## Related
+
+- [Embedding model integrations](/oss/integrations/embeddings)
+- [Sentence Transformers on Hugging Face](/oss/integrations/embeddings/sentence_transformers)
+- [Text Embeddings Inference](/oss/integrations/embeddings/text_embeddings_inference)
+- [Cross Encoder Reranker](/oss/integrations/document_transformers/cross_encoder_reranker)
+- [Retrieval concept guide](/oss/langchain/retrieval)
diff --git a/src/oss/langchain/rag.mdx b/src/oss/langchain/rag.mdx
@@ -9,6 +9,7 @@ import EmbeddingsTabsPy from '/snippets/embeddings-tabs-py.mdx';
 import EmbeddingsTabsJS from '/snippets/embeddings-tabs-js.mdx';
 import VectorstoreTabsPy from '/snippets/vectorstore-tabs-py.mdx';
 import VectorstoreTabsJS from '/snippets/vectorstore-tabs-js.mdx';
+import RerankerTabsPy from '/snippets/reranker-tabs-py.mdx';
 
 ## Overview
 
@@ -908,6 +909,38 @@ const agent = createAgent({
 </Accordion>
 
 
+## Improve retrieval with reranking
+
+Vector search returns the top-k chunks by embedding similarity, which is a cheap approximation of relevance. A reranker is a second model (a cross-encoder or ranking API) that scores each `(query, chunk)` pair directly for more accurate ordering. The standard recipe is to retrieve a larger `k` from the vector store (e.g. 20) and then rerank down to the handful of documents you actually pass to the model. In practice, this is one of the highest-impact quality improvements you can make to a RAG pipeline, and with an open-source cross-encoder it runs locally on CPU for free.
+
+:::python
+Select a reranker:
+
+<RerankerTabsPy />
+
+Wrap the base retriever with `ContextualCompressionRetriever`:
+
+```python
+from langchain_classic.retrievers.contextual_compression import ContextualCompressionRetriever
+
+base_retriever = vector_store.as_retriever(search_kwargs={"k": 20})
+
+compression_retriever = ContextualCompressionRetriever(
+    base_compressor=reranker,
+    base_retriever=base_retriever,
+)
+
+reranked_docs = compression_retriever.invoke("What is task decomposition?")
+```
+
+Use `compression_retriever` anywhere you previously used `vector_store.similarity_search`, e.g. in the RAG agent's retrieval tool, or in the RAG chain's `before_model` middleware.
+:::
+:::js
+Rerankers are available in JavaScript via provider-specific integrations (see [Cohere Rerank](/oss/integrations/document_compressors/cohere_rerank) and [Mixedbread AI](/oss/integrations/document_compressors/mixedbread_ai)). The pattern is the same: wrap your base retriever with a document compressor.
+:::
+
+See the [Cross Encoder Reranker guide](/oss/integrations/document_transformers/cross_encoder_reranker) for more on local reranking with Hugging Face models.
+
 ## Security: indirect prompt injection
 
 <Warning>