-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Surface rerankers and modernize Hugging Face integration pages #3673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Tom Aarsen (tomaarsen)
wants to merge
2
commits into
langchain-ai:main
Choose a base branch
from
tomaarsen:docs/hf-docs-and-rerankers
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| --- | ||
| title: Choose an embedding model | ||
| sidebarTitle: Choose an embedding model | ||
| description: Practical guidance for selecting a text embedding model for retrieval-augmented applications. | ||
| --- | ||
|
|
||
| Embeddings are the foundation of most retrieval pipelines: text enters as vectors, queries are compared against those vectors, and retrieval quality is bounded by the quality of that comparison. There is no single "best" embedding model; the right choice depends on your data, your latency and cost budget, and where the embeddings will run. | ||
|
|
||
| ## Four common deployment patterns | ||
|
|
||
| In practice, most teams converge on one of four patterns: | ||
|
|
||
| 1. Hosted, flagship: OpenAI `text-embedding-3-large`, Cohere `embed-english-v3`, Google `gemini-embedding-001`, Voyage `voyage-3`. One API call, best-in-class quality out of the box, no local infrastructure. Per-call cost and a data-egress dependency. | ||
| 2. Local, open-source: `BAAI/bge-*`, `mixedbread-ai/mxbai-embed-*`, `Qwen/Qwen3-Embedding-*`, `nomic-ai/modernbert-embed-*`, `sentence-transformers/all-*`. Download once, run anywhere. No per-call cost, data never leaves your environment. Likely slower on CPU than a hosted API at small scale; competitive or faster with a GPU. | ||
| 3. Local, open-source, specialist: a fine-tuned model targeting your specific domain, language, or task. Starting from a strong open base (e.g. `BAAI/bge-m3`) and fine-tuning on even a few thousand in-domain query/document pairs often beats hosted flagships on retrieval accuracy for that domain. | ||
| 4. Self-hosted at production scale: the same open models (base or fine-tuned) served via [Text Embeddings Inference (TEI)](/oss/integrations/embeddings/text_embeddings_inference) or Ollama. Gives you the economics of local inference with the horizontal scaling and API ergonomics of a hosted provider. | ||
|
|
||
| LangChain treats all four the same: you instantiate an `Embeddings` subclass and hand it to your vector store or retriever. Patterns (2) and (3) use `HuggingFaceEmbeddings`; pattern (4) uses `HuggingFaceEndpointEmbeddings` or `OllamaEmbeddings`. | ||
|
|
||
| ## Factors to weigh | ||
|
|
||
| ### Quality | ||
|
|
||
| Start from the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). MTEB benchmarks embedding models across retrieval, clustering, classification, and reranking tasks, and is the de-facto industry reference. Filter by your language(s) and by task (retrieval is the most common for RAG). | ||
|
|
||
| Leaderboard numbers don't always transfer, so run a small evaluation on your own data before committing. LangSmith has tooling for this; see the [evaluation guides](/langsmith/evaluation-concepts). | ||
|
|
||
| ### Cost | ||
|
|
||
| Hosted embeddings typically price in the range of a few cents to ~$0.15 per million tokens. For a corpus embedded once and queried thousands of times a day, cost is often dominated by the query side. | ||
|
|
||
| Local inference has zero per-call cost but requires CPU (slow) or GPU (capital or cloud cost). The crossover is workload-dependent: low-volume personal projects are essentially free on CPU; for mid-volume production, a single GPU serving a local model via TEI often beats hosted on unit economics. | ||
|
|
||
| ### Latency | ||
|
|
||
| Hosted embedding APIs add roughly 50-200ms of network latency per request. Local models on CPU take 10-100ms for a short query with a small model (`all-MiniLM-L6-v2`-class), and 50-500ms for larger models. On GPU, local inference is typically faster than a round-trip to a hosted API. | ||
|
|
||
| For batch indexing, latency per request matters less than throughput. TEI and multi-process local inference batch aggressively. Consider e.g. `encode_kwargs={"batch_size": 64}` or higher on `HuggingFaceEmbeddings` when running on GPU. | ||
|
|
||
| ### Dimensionality | ||
|
|
||
| Embedding dimension affects vector store storage and query compute. Typical sizes: | ||
|
|
||
| - 384 (small Sentence Transformers models, `all-MiniLM-L6-v2`) | ||
| - 768 (mid-size ST models, `all-mpnet-base-v2`, `bge-base`) | ||
| - 1024 (`bge-large`, Cohere v3, Voyage) | ||
| - 1536 (OpenAI `text-embedding-3-small`, Qwen3-Embedding-0.6B) | ||
| - 3072+ (OpenAI `text-embedding-3-large`, Qwen3-Embedding-4B/8B) | ||
|
|
||
| Larger vectors are usually more accurate but consume more storage and query compute. Several modern models (OpenAI `text-embedding-3-*`, `mixedbread-ai/mxbai-embed-large-v1`, Matryoshka-trained ST models, Qwen3-Embedding) support **truncation**: slice the vector to a smaller dimension with graceful quality degradation. Useful for fitting more vectors into a smaller index. | ||
|
|
||
| ### Context length | ||
|
|
||
| Most classic embedding models cap out at 512 tokens (`all-mpnet-base-v2`, classic BGE). Newer models support longer contexts: | ||
|
|
||
| - `nomic-ai/modernbert-embed-base`: 8192 tokens | ||
| - `Alibaba-NLP/gte-multilingual-base`: 8192 tokens | ||
| - `BAAI/bge-m3`: 8192 tokens | ||
| - OpenAI `text-embedding-3-*`: 8191 tokens | ||
|
|
||
| If your chunks are long (full-page technical docs, legal paragraphs), prefer long-context models. For short chunks the 512-token limit is rarely binding. | ||
|
|
||
| ### Multilingual support | ||
|
|
||
| For multilingual retrieval, pick a model trained on your languages. Strong defaults: | ||
|
|
||
| - Open: `BAAI/bge-m3`, `intfloat/multilingual-e5-*`, `Alibaba-NLP/gte-multilingual-*`, `Qwen/Qwen3-Embedding-*` (via `HuggingFaceEmbeddings`) | ||
| - Hosted: Cohere `embed-multilingual-v3`, OpenAI `text-embedding-3-*` | ||
|
|
||
| ### Query and document prompts | ||
|
|
||
| Several modern open models (E5, BGE, Qwen3-Embedding, GTE) are trained with different text prefixes for queries versus documents. Using the wrong prefix at query time is a common quality regression. When using `HuggingFaceEmbeddings`, pass prompts explicitly: | ||
|
|
||
| :::python | ||
| ```python | ||
| from langchain_huggingface import HuggingFaceEmbeddings | ||
|
|
||
| embeddings = HuggingFaceEmbeddings( | ||
| model_name="intfloat/e5-large-v2", | ||
| encode_kwargs={"prompt": "passage: "}, | ||
| query_encode_kwargs={"prompt": "query: "}, | ||
| ) | ||
| ``` | ||
| ::: | ||
| :::js | ||
| The `query_encode_kwargs` API shown above is Python-specific. In JavaScript, hosted providers (OpenAI, Cohere, Google) handle query and document prompts internally; for local inference, prepend the prefix manually to your queries and documents before embedding. | ||
| ::: | ||
|
|
||
| Check each model's card on Hugging Face for the recommended prompt strings. | ||
|
|
||
| ### Licensing | ||
|
|
||
| Most popular open embedding models are permissively licensed (Apache 2.0, MIT). A few recent specialist models require a commercial license for production use. Check each model's license before shipping. | ||
|
|
||
| ## Beyond single-vector dense embeddings | ||
|
|
||
| A single dense vector per chunk is the default, but not the only option. | ||
|
|
||
| ### Rerankers | ||
|
|
||
| Vector search returns the top-k most similar chunks by embedding distance, a cheap approximation of relevance. A reranker scores each `(query, chunk)` pair directly using a cross-encoder, producing more accurate ordering at the cost of one extra inference per chunk. Retrieve top-20 via embeddings, rerank down to top-5, one of the highest-impact quality improvements you can make. See [Cross Encoder Reranker](/oss/integrations/document_transformers/cross_encoder_reranker) for the local open-source path, or [Cohere Rerank](/oss/integrations/retrievers/cohere-reranker) for the hosted equivalent. | ||
|
|
||
| ### Sparse and hybrid retrieval | ||
|
|
||
| Dense embeddings don't handle exact-match queries (product codes, named entities, code identifiers) as well as keyword-based indexes. Hybrid retrieval combines a dense index with BM25 or a sparse neural index (SPLADE, `BAAI/bge-m3`'s sparse output) to cover both cases. | ||
|
|
||
| ### Late-interaction and multi-vector | ||
|
|
||
| ColBERT-style models produce a vector per token rather than per chunk, then score queries against documents via late interaction. This is typically more accurate than single-vector dense retrieval on complex queries, at the cost of higher storage and more complex indexing. Current open models in this space include `jinaai/jina-colbert-v2`, `answerdotai/answerai-colbert-small-v1`, and newer late-interaction variants such as `lightonai/LateOn`. LangChain's built-in retrievers target single-vector embeddings; late interaction typically requires a specialist index (Vespa, Qdrant's multi-vector support, or PyLate). | ||
|
|
||
| ## Starting points | ||
|
|
||
| If you just want a working starting point: | ||
|
|
||
| - Quick prototype, hosted: `OpenAIEmbeddings(model="text-embedding-3-small")` | ||
| - Quick prototype, local, no API key: `HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", encode_kwargs={"normalize_embeddings": True})` | ||
| - Production, hosted, quality-first: `VoyageAIEmbeddings(model="voyage-3")` or `OpenAIEmbeddings(model="text-embedding-3-large")` | ||
| - Production, open, quality-first: `HuggingFaceEmbeddings(model_name="BAAI/bge-m3", encode_kwargs={"normalize_embeddings": True})` served via TEI | ||
| - Multilingual, open: `HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-large")` with query and document prompts configured | ||
|
|
||
| Measure retrieval quality on your own data, then iterate. | ||
|
|
||
| ## Related | ||
|
|
||
| - [Embedding model integrations](/oss/integrations/embeddings) | ||
| - [Sentence Transformers on Hugging Face](/oss/integrations/embeddings/sentence_transformers) | ||
| - [Text Embeddings Inference](/oss/integrations/embeddings/text_embeddings_inference) | ||
| - [Cross Encoder Reranker](/oss/integrations/document_transformers/cross_encoder_reranker) | ||
| - [Retrieval concept guide](/oss/langchain/retrieval) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad to leave these changes be, but I get lots of questions about choosing embedding models personally, so I think this kind of page can be very helpful.