This project demonstrates the complete FastEmbed + Qdrant retrieval stack through a modular, well-organized structure. Each FastEmbed method has its own dedicated folder with detailed documentation and interactive demos. Based on the comprehensive Qdrant FastEmbed documentation.
Below is a practical map of embedding types (dense, sparse, multi-vector), advanced retrieval (miniCOIL, SPLADE, ColBERT), and rerankingβall using the FastEmbed + Qdrant tooling.
Overview: Encode each text as one vector. Great first-stage retriever: fast, compact, easy to scale. FastEmbed focuses on speed (quantized ONNX models, CPU-first).
Key concepts (quick bullets)
- One vector per item β cosine/dot similarity for k-NN search.
- Throughput: ONNX Runtime + quantization β strong CPU performance.
- Use cases: semantic search; first pass for multi-stage pipelines.
Details: Dense embeddings excel when wording varies (semantic matches) but they can miss exact keywords/IDs. In Qdrant you store these as standard vectors and query via similarity; later, you can fuse with sparse signals (RRF/DBSF) to keep both "meaning" and "must-have words."
Overview: Encode text into a very high-dimensional sparse vector (most entries zero). Scores behave like learned keyword searchβinterpretable and great for exact terms, IDs, formulas, acronyms. Qdrant stores them natively (indices + values).
Key concepts (quick bullets)
- Vocab-indexed space (e.g., BERT WordPiece IDs).
- Interpretable: each non-zero weight maps to a token/term.
- Hybrid-friendly: combine with dense results to cover both semantics and exact matches.
Details:
- SPLADE (via FastEmbed): learns sparse vectors that often outperform BM25 and remain interpretable. FastEmbed exposes models like
prithivida/Splade_PP_en_v1(Apache-2.0 variant), returning(indices, values)you can store directly in Qdrant. Typical corpora end up with ~tens to low hundreds of non-zeros per item, despite ~30k vocab size. - miniCOIL (via FastEmbed/Qdrant): a lightweight sparse neural retriever: think "BM25 that understands meaning." It builds a bag-of-stems weighted by BM25, but each term gets a small semantic embedding so matches are context-aware (e.g., 'vector' in medicine vs graphics). In Qdrant you enable IDF modifier and supply corpus avg length; inference + upload can be handled transparently via the client.
Overview: Models like ColBERT output one vector per token (a matrix per text). At query time, they compare query tokens vs document tokens using MaxSim and aggregate to a relevance scoreβcapturing fine-grained matches (names, entities, snippets).
Key concepts (quick bullets)
- Token-level vectors (e.g., 96β128 dims each).
- MaxSim late-interaction scoring at query time.
- Trade-off: much better matching granularity, but more memory/compute; often used as a reranker for top-K.
Details: Qdrant supports multivector collections and MAX_SIM comparators, so you can store ColBERT outputs as matrices and score with late interaction natively. The Qdrant docs recommend using ColBERT primarily as a reranker for 100β500 dense/sparse candidates in production for speed.
Overview: Sparse neural retrieval that augments BM25 with term semantics. Best when exact keyword presence is required, but you want context-aware ranking (e.g., "vector control" in public health vs graphics).
Important concepts
- BM25-based scoring Γ semantic similarity between matched terms.
- Needs IDF modifier and avg document length (BM25 ingredients) when creating the Qdrant collection.
Details: The HF card notes miniCOIL creates small (4-dim) meaning embeddings per stem, combined into a sparse BoW and weighted by BM25βso ranking becomes both term-anchored and disambiguation-aware. Qdrant's example shows end-to-end ingestion and querying.
Overview: Learns sparse vocab-space vectors that are efficient and interpretable and often beat BM25. Excellent for large-scale retrieval, logs analysis, tech docs, and anything that benefits from seeing important terms directly.
Important concepts
- Expansion in vocab space: the model assigns weights to vocab entries a text implies (not just explicit words).
- Compact at query time: only tens of non-zeros per query on average; works well with ANN-like sparse indexing.
Details: FastEmbed exposes SPLADE++ as Apache-licensed models; the tutorial shows listing models, embedding, and getting (indices, values) for Qdrant. This makes SPLADE trivial to adopt alongside dense vectors for hybrid search.
Overview: Late-interaction retriever using token embeddings and MaxSim aggregation. Dramatically improves matching granularity (entities, numbers, code identifiers), and is well-supported in Qdrant via MultiVectorConfig.
Important concepts
- Independent encoding (documents and queries separately).
- MaxSim: for each query token, take the max similarity over document tokens; aggregate.
- Use as reranker for efficiency on large corpora.
Details: FastEmbed provides LateInteractionTextEmbedding with ready ColBERT models (e.g., colbert-ir/colbertv2.0 with dim=128). Qdrant lets you configure MAX_SIM as the multivector comparator and store the matrices directly.
Why rerank? First-stage retrieval (dense/sparse) is cheap and recall-oriented. Reranking spends more compute on a small candidate set (e.g., top-100) to sharpen precision. Three common options in this stack:
-
Cross-encoder rerankers (e.g., Jina Reranker v2 in FastEmbed)
- Take [query, document] together and output a relevance score (0β1).
- Very accurate, most expensive per pair; ideal as a final pass on small K. FastEmbed documents a full example.
-
ColBERT as a reranker (late interaction)
- Faster than cross-encoders at inference (no joint encoding), still fine-grained via MaxSim.
- Qdrant docs explicitly recommend using ColBERT mainly for reranking (100β500 candidates).
-
Rank-fusion reranking (no model)
- Fuse dense + sparse (and/or late-interaction) result lists with algorithms like RRF (and DBSF).
- Built into Qdrant Query APIβsimple, robust, and cheap; great when scores aren't comparable.
Rule of thumb: Retrieve broadly with dense + sparse, fuse (RRF/DBSF), then rerank top-K with a cross-encoder or ColBERT depending on your latency budget.
-
General semantic search with exact-term guarantees: Dense (FastEmbed) + SPLADE or miniCOIL β RRF/DBSF fusion β optional cross-encoder for the top-50.
-
Entity/ID-heavy corpora (APIs, legal, code): First stage: SPLADE or miniCOIL (keyword-anchored). Second stage: ColBERT rerank for fine-grained token matches.
-
Latency-sensitive, CPU-only RAG: Dense (FastEmbed ONNX) + SPLADE mini-fusion (RRF). Only add cross-encoder/ColBERT rerank when absolutely needed.
Illustrative onlyβfocus on how pieces fit. See the linked docs for full examples.
from qdrant_client import QdrantClient, models
from fastembed import TextEmbedding, SparseTextEmbedding, LateInteractionTextEmbedding
# 1) Models
dense = TextEmbedding(model_name="BAAI/bge-small-en-v1.5") # example
sparse = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1") # SPLADE
late = LateInteractionTextEmbedding(model_name="colbert-ir/colbertv2.0")
# 2) Collections
client = QdrantClient(":memory:")
# Dense (single vector)
client.create_collection(
"docs_dense",
vectors_config=models.VectorParams(size=dense.embedding_size, distance=models.Distance.COSINE),
)
# Sparse (SPLADE)
client.create_collection(
"docs_sparse",
sparse_vectors_config={"splade": models.SparseVectorParams()},
)
# Multi-vector (ColBERT with MAX_SIM)
client.create_collection(
"docs_colbert",
vectors_config=models.VectorParams(
size=late.embedding_size, distance=models.Distance.COSINE,
multivector_config=models.MultiVectorConfig(
comparator=models.MultiVectorComparator.MAX_SIM
),
),
)
# 3) Retrieval + fusion + rerank (pseudo)
dense_hits = client.query_points("docs_dense", query_vector=dense.embed("query")[0], limit=200)
sparse_vec = next(sparse.embed(["query"]))
sparse_hits = client.query_points("docs_sparse", query=models.SparseVector(**sparse_vec.model_dump()), using="splade", limit=200)
# Fuse (RRF/DBSF; supported in Qdrant Query API)
# ... then rerank top-K with cross-encoder or ColBERT token-level scoring ...Cites for API capabilities and recommended config: multivector/MAX_SIM (ColBERT), sparse vectors, hybrid fusion (RRF/DBSF), cross-encoder rerankers.
FastEmbed is a lightweight and fast library for generating text embeddings and sparse representations. It's designed to be faster and lighter than other embedding libraries like Transformers and Sentence-Transformers, and is supported and maintained by Qdrant.
- Modular Structure: Each method in its own folder with documentation
- Multiple Embedding Types: Dense, sparse, and multi-vector embeddings
- Advanced Retrieval: miniCOIL, SPLADE, and ColBERT support
- Reranking: Post-processing to improve search results
- Qdrant Integration: Seamless vector database integration
- Interactive Demos: Each method has its own demo with explanations
Before running the examples, make sure you have:
- Python 3.7+ installed
- Qdrant server running locally on http://localhost:6333
git clone <repository-url>
cd qdrant-fastembed-quickstartpip install -r requirements.txtdocker run -p 6333:6333 qdrant/qdrantCreate a docker-compose.yml file:
version: '3.8'
services:
qdrant:
image: qdrant/qdrant
ports:
- "6333:6333"
volumes:
- ./qdrant_storage:/qdrant/storageThen run:
docker-compose up -dFollow the official installation guide.
# Copy the example environment file
cp .env.example .env
# Edit .env file with your Qdrant configuration
# QDRANT_URL=http://localhost:6333
# QDRANT_API_KEY=your-api-key-hereNote: If you don't create a .env file, the demos will use default values (localhost:6333).
Run the main menu to access all demos:
python main.pyThis launches an interactive menu that runs each demo in its dedicated folder.
You can also run individual demos directly:
# Basic embeddings
python 01_basic_embeddings/demo.py
# miniCOIL sparse retrieval
python 02_minicoil/demo.py
# SPLADE sparse embeddings
python 03_splade/demo.py
# ColBERT multi-vector search
python 04_colbert/demo.py
# Reranking
python 05_reranking/demo.py
# Qdrant integration
python 06_qdrant_integration/demo.py
# Method comparison
python 08_comparison/demo.py- 01 - Basic Text Embeddings - Standard semantic embeddings
- 02 - miniCOIL Sparse Retrieval - Keyword-aware semantic search
- 03 - SPLADE Sparse Embeddings - Learned sparse representations
- 04 - ColBERT Multi-Vector Search - Fine-grained token matching
- 05 - Reranking - Post-processing to improve results
- 06 - Qdrant Integration - Vector database integration
- 07 - Model Content Protocol Integration - MCP integration
- 08 - Method Comparison - Side-by-side comparison
Each demo folder contains:
demo.py- Interactive demonstrationREADME.md- Detailed documentation and explanations
π FastEmbed Comprehensive Demo
============================================================
1. Basic Text Embeddings (Dense)
2. miniCOIL Sparse Retrieval
3. SPLADE Sparse Embeddings
4. ColBERT Multi-Vector Search
5. Reranking
6. Qdrant Integration Demo
7. Compare All Methods
9. Exit
============================================================
Enter your choice (1-8): 1
πΉ Basic Text Embeddings Demo
----------------------------------------
Loading FastEmbed model (BAAI/bge-small-en-v1.5)...
β
Model loaded successfully!
Generating embeddings...
π Generated 3 embeddings
π Vector dimensions: 384
π Cosine similarity between first two documents: 0.6717
π Sample embedding (first 5 dimensions):
[-0.09479033 0.01007713 -0.03085082 0.02376419 0.00238941]
| Method | Type | Use Case |
|---|---|---|
| Dense (BGE) | Dense | General semantic search |
| miniCOIL | Sparse | Keyword + semantic hybrid |
| SPLADE | Sparse | Lexical + semantic hybrid |
| ColBERT | Multi | Fine-grained token matching |
| Reranking | Post-proc | Improve initial results |
qdrant-fastembed-quickstart/
βββ main.py # Main menu interface
βββ basic_example.py # Original basic example
βββ requirements.txt # Python dependencies
βββ README.md # This main documentation
βββ .gitignore # Git ignore rules for security and cleanliness
βββ .env.example # Environment configuration template
βββ docker-compose.yml # Qdrant setup
βββ 01_basic_embeddings/ # Dense embeddings demo
β βββ demo.py # Interactive demonstration
β βββ README.md # Detailed documentation
βββ 02_minicoil/ # miniCOIL sparse retrieval
β βββ demo.py # Interactive demonstration
β βββ README.md # Detailed documentation
βββ 03_splade/ # SPLADE sparse embeddings
β βββ demo.py # Interactive demonstration
β βββ README.md # Detailed documentation
βββ 04_colbert/ # ColBERT multi-vector search
β βββ demo.py # Interactive demonstration
β βββ README.md # Detailed documentation
βββ 05_reranking/ # Reranking demo
β βββ demo.py # Interactive demonstration
β βββ README.md # Detailed documentation
βββ 06_qdrant_integration/ # Qdrant integration
β βββ demo.py # Interactive demonstration
β βββ README.md # Detailed documentation
βββ 07_mcp_server/ # Model Context Protocol integration
β βββ README.md # Detailed documentation
βββ 08_comparison/ # Method comparison
βββ demo.py # Interactive demonstration
βββ README.md # Detailed documentation
The main menu (main.py) provides easy access to all FastEmbed demonstrations through a clean, organized interface.
Each FastEmbed method has its own dedicated folder containing:
- Interactive Demo:
demo.pywith hands-on examples - Detailed Documentation:
README.mdwith comprehensive explanations - Focused Learning: Each demo focuses on one specific method
- 01 - Basic Embeddings: Standard dense text embeddings using BAAI/bge-small-en-v1.5
- 02 - miniCOIL: Sparse neural retrieval combining BM25 with semantic understanding
- 03 - SPLADE: Sparse lexical and dense embeddings with learned weights
- 04 - ColBERT: Multi-vector search with fine-grained token matching
- 05 - Reranking: Post-processing to improve search result quality
- 06 - Qdrant Integration: Vector database integration examples
- 07 - MCP Server Integration: MCP Server Integration
- 08 - Comparison: Side-by-side method comparison and selection guide
- Start with 01 - Basic Embeddings to understand fundamentals
- Explore 02 - miniCOIL and 03 - SPLADE for keyword-aware search
- Try 04 - ColBERT for fine-grained matching
- Learn about 05 - Reranking for improving results
- Understand 06 - Qdrant Integration
- Use 08 - Comparison to choose the right method for your use case
- Dense Embeddings: General purpose semantic search
- miniCOIL: When exact keyword matches matter but context is important
- SPLADE: When you need both lexical and semantic matching
- ColBERT: When fine-grained token-level matching is crucial
- Reranking: As a second stage to improve any initial retrieval
- Explore different embedding models available in FastEmbed
- Integrate with Qdrant vector database for storage and search
- Build semantic search applications with hybrid approaches
- Experiment with different text preprocessing techniques
- Set up a Qdrant instance for full functionality testing
- FastEmbed Documentation - Getting started with FastEmbed
- FastEmbed SPLADE Guide - Working with SPLADE sparse embeddings
- FastEmbed miniCOIL Guide - Working with miniCOIL sparse retrieval
- FastEmbed ColBERT Guide - Working with ColBERT multi-vector search
- FastEmbed Reranking Guide - Reranking with FastEmbed
- Hybrid Search with FastEmbed - Setup hybrid search
- Qdrant Vectors Documentation - Understanding vector storage
- Qdrant Sparse Vectors - Sparse vector concepts
- Qdrant Hybrid Queries - Combining dense and sparse search
- What is a Sparse Vector? - Sparse vector deep dive
- ColBERT: Efficient and Effective Passage Search - ColBERT research paper
- miniCOIL: on the Road to Usable Sparse Neural Retrieval - miniCOIL background
- SPLADE with FastEmbed Example - SPLADE implementation example
- Qdrant/minicoil-v1 on Hugging Face - miniCOIL model card
- Sparse Vectors Benchmark - Performance comparisons
- FastEmbed GitHub Repository - Source code and issues
- Qdrant Documentation - Complete Qdrant documentation