-
Notifications
You must be signed in to change notification settings - Fork 528
Description
Sub-Issues (Implementation Plan)
Feature branch: feat/rag-ingestion → all sub-PRs target this branch
| # | Issue | Scope | Depends On |
|---|---|---|---|
| 1 | #1264 Core Types & Interfaces | pkg/vectorstore/ types + VectorStoreBackend interface |
— |
| 2 | #1265 Text Chunking Module | Chunking strategies (auto, static) + file parsers (.txt, .md, .json, .csv, .html) | #1264 |
| 3 | #1266 Vector Store Backends | Milvus + in-memory HNSW implementations | #1264 |
| 4 | #1267 File Storage Service | Local filesystem file management | #1264 |
| 5 | #1268 Vector Store Management API | CRUD endpoints: POST/GET/LIST/DELETE /v1/vector_stores |
#1264, #1266 |
| 6 | #1269 File Upload API | Upload endpoints: POST/GET/LIST/DELETE /v1/files |
#1267 |
| 7 | #1270 Async Ingestion Pipeline | Attach file → chunk → embed → store (async worker pool) | #1264-#1267 |
| 8 | #1271 Vector Store Search API | POST /v1/vector_stores/{id}/search |
#1266, #1268 |
| 9 | #1272 RAG Plugin Integration | New "vectorstore" RAG backend type in extproc pipeline | #1271 |
| 10 | #1273 Conversation Auto-Capture | Response filter captures Q+A pairs for future RAG | #1268, #1270 |
| 11 | #1274 Config, Init & Integration Tests | Config section, startup init, E2E tests | All above |
Future Enhancements (deferred from initial implementation)
- PDF/DOCX file format support (needs external Go libraries)
- Redis vector store backend (third backend option)
- Batch file upload (
POST /v1/vector_stores/{id}/file_batches- up to 500 files) - CLI commands (
vllm-sr rag ingest,vllm-sr rag list,vllm-sr rag delete) - Dashboard UI for document management (upload, browse, search)
- Semantic chunking (split by meaning using embedding similarity)
- Hybrid search (vector similarity + BM25 keyword matching)
- Cross-encoder reranking for better retrieval quality
- Document update/re-embedding (detect changes, re-process)
- File deduplication (prevent duplicate uploads)
- Configurable capture mode (query_only, response_only, full_pair)
- Distributed file storage (S3/MinIO for multi-node deployments)
- Token-based chunking (needs tokenizer on Go side)
- Conversation capture filtering (exclude topics, PII scrubbing)
- Vector store expiration (auto-expire based on ExpiresAfter policy)
- File content search (search across metadata, not just embeddings)
Summary
The RAG retrieval plugin (#1152) is merged and functional, but there's no native way to populate the vector database with documents using vSR's own embedding models. This creates a gap where users must rely on external tools (LlamaIndex, OpenAI, etc.) to ingest documents before vSR can retrieve from them.
This issue proposes adding a document ingestion pipeline that completes the RAG stack using vSR's own infrastructure.
Problem
Current state:
Documents → [??? NO SOLUTION ???] → Vector DB → RAG Retrieval (#1152) → LLM
Users must:
- Use external tools (LlamaIndex, NVIDIA NeMo, etc.) to chunk and embed documents
- Or use OpenAI's hosted file_search (data leaves their network)
- Or manually populate Milvus/Redis
This defeats vSR's differentiator: own embedding models (mmBERT, multimodal) that keep data local.
Proposed Solution
Add native document ingestion that leverages vSR's existing infrastructure, following the OpenAI Vector Stores API pattern:
Documents → Upload API → Chunk → Embed (mmBERT) → Store (Milvus/Redis)
↓
Uses existing Candle binding
Uses existing vector backends
API Design (OpenAI-Compatible)
Vector Store Management:
POST /v1/vector_stores- Create vector storeGET /v1/vector_stores/{id}- Get vector store detailsDELETE /v1/vector_stores/{id}- Delete vector store
File Operations:
POST /v1/files- Upload file (purpose: "assistants")POST /v1/vector_stores/{id}/files- Attach file → auto chunk/embed/indexGET /v1/vector_stores/{id}/files- List files with processing status
Search:
POST /v1/vector_stores/{id}/search- Query with filters
Conversation Auto-Capture
Additionally, a response filter will auto-capture Q+A pairs from live traffic:
LLM Response → Response Filter → Embed Q → Store Q+A in vector store
This allows vSR to learn from real usage and provide better context for future queries.
vSR Differentiator
While following OpenAI's API pattern, vSR adds:
- Local embeddings using vSR's own models (mmBERT, Qwen3, etc.) - no API costs, data stays local
- Multiple backends - Milvus, in-memory HNSW (Redis planned)
- Integration with routing - RAG context used in routing decisions
- Training-free usability - populate knowledge base, no model training needed
Use Cases
1. Enterprise Knowledge Base
Company uploads HR policies, product docs, FAQs
→ vSR chunks and embeds using mmBERT (local, no API cost)
→ Queries retrieve relevant context
→ Route to LLM with company-specific knowledge
2. Learning From Support Traffic
Customer asks question → LLM answers → vSR captures Q+A
Later, similar question → vSR retrieves past Q+A as context
→ More consistent, informed answers over time
3. Domain-Specific RAG with Routing
Medical org uploads clinical guidelines → collection "medical-kb"
Legal dept uploads contracts → collection "legal-kb"
Query classified as medical → retrieve from medical-kb → route to medical-LLM
Query classified as legal → retrieve from legal-kb → route to legal-LLM
Related Issues/PRs
| Reference | Relationship |
|---|---|
| #1152 | RAG retrieval plugin (merged) - this completes it |
| #1194 | Agentic Memory - similar storage patterns |
| #155 | RAG-Optimized Routing - broader vision |
| #1255 | RAG CLI support - would need ingestion commands |
| #806 | Context Engineering - related context management |
Acceptance Criteria
- Documents can be uploaded via OpenAI-compatible API
- Documents are chunked with configurable strategies
- Chunks are embedded using vSR's models (mmBERT, etc.)
- Embeddings are stored in Milvus or in-memory HNSW
- Stored documents can be retrieved by RAG plugin ([Feat] implement Demand RAG (Retrieval-Augmented Generation) plugin #1152)
- Documents can be listed and deleted
- Multi-collection support for isolation
- Conversation auto-capture stores Q+A pairs for future RAG
- Async processing for file ingestion (like OpenAI)
Future Enhancement: Quality Feedback Loop & Offline Improvement
The conversation auto-capture feature accumulates Q+A pairs over time, but without quality signals it gets wider (more topics) without getting smarter (more accurate). Captured bad answers can reinforce mistakes.
The Problem
No quality signal exists to distinguish good captured answers from bad ones. The knowledge base grows but doesn't improve.
Proposed Solution: Offline Quality Pipeline
A periodic offline process that curates the captured knowledge base:
- LLM-as-judge — Score each stored Q+A for accuracy/helpfulness, prune low-scoring entries
- Cluster + consolidate — Group similar Q+As by topic, merge duplicates into one high-quality "golden" answer per cluster
- Benchmark evaluation — Test known questions against the knowledge base, measure retrieval quality, remove entries that hurt more than help
- Embedding fine-tuning — Use curated Q+As to fine-tune the embedding model for better future retrieval
This creates a cycle: capture conversations (online) → curate and improve (offline) → better RAG (online).
Connection to AvengersPro
The AvengersPro approach is relevant here:
- AvengersPro clusters prompts using embeddings, then evaluates model performance per cluster using benchmarks
- Applied here: cluster captured conversations → evaluate answer quality per cluster → keep only high-quality answers → the knowledge base actually improves over time
- This bridges the gap between "training-free" (just capture and retrieve) and "training-aware" (curate based on quality metrics)
Offline Improvement Approaches
| Approach | Input | Output | Complexity |
|---|---|---|---|
| LLM-as-judge scoring | Stored Q+As | Quality scores, pruned bad entries | Low |
| Cluster deduplication | Similar Q+As | One golden answer per topic | Medium |
| Benchmark eval | Known Q+A pairs | Retrieval quality metrics | Medium |
| Embedding fine-tuning | Curated Q+As | Better embedding model | High |
Opportunities This Feature Unlocks
The in-router RAG + conversation capture infrastructure opens up additional capabilities beyond the initial implementation. These are not in scope for the current plan, but worth noting as future potential:
"Your Router Remembers" — Institutional Knowledge
A support agent answers a tricky question through the router. Weeks later, a different user asks something similar. vSR surfaces the previous answer as context — institutional knowledge that doesn't walk out the door when someone leaves. No one had to write a doc or update a FAQ.
"Zero-Effort Knowledge Base"
No one has to write docs or upload files. Just use the system. Over time, it builds its own knowledge base from real conversations. Day 1: empty. Day 90: hundreds of real Q+A pairs, auto-curated.
"Context Without the Context Window"
LLMs forget after their context window (128K tokens). vSR's vector store doesn't. A conversation from 6 months ago can inform today's answer — retrieved on demand via embedding similarity.
"Data Stays Local"
Enterprise customers who can't send data to OpenAI for embedding/storage can use vSR's local embedding models (mmBERT, Qwen3). Documents are chunked, embedded, and stored entirely on-premises. OpenAI-compatible API, but no data leaves the network.
"No Training Required"
Upload docs and go. No ML expertise needed, no model training, no GPU clusters. The embedding model is pre-trained — just populate the knowledge base and retrieval works immediately.
Future: Per-Tenant Memory
Multi-tenant vector stores would give each customer/team their own isolated memory. Tenant A's conversations never leak into Tenant B's context. Each tenant's knowledge base grows independently from their own traffic.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status