[Feat] Add Document Ingestion Pipeline for RAG (Upload, Chunk, Embed, Store)

## Sub-Issues (Implementation Plan)

**Feature branch:** `feat/rag-ingestion` → all sub-PRs target this branch

| # | Issue | Scope | Depends On |
|---|-------|-------|------------|
| 1 | #1264 Core Types & Interfaces | `pkg/vectorstore/` types + `VectorStoreBackend` interface | — |
| 2 | #1265 Text Chunking Module | Chunking strategies (auto, static) + file parsers (.txt, .md, .json, .csv, .html) | #1264 |
| 3 | #1266 Vector Store Backends | Milvus + in-memory HNSW implementations | #1264 |
| 4 | #1267 File Storage Service | Local filesystem file management | #1264 |
| 5 | #1268 Vector Store Management API | CRUD endpoints: POST/GET/LIST/DELETE `/v1/vector_stores` | #1264, #1266 |
| 6 | #1269 File Upload API | Upload endpoints: POST/GET/LIST/DELETE `/v1/files` | #1267 |
| 7 | #1270 Async Ingestion Pipeline | Attach file → chunk → embed → store (async worker pool) | #1264-#1267 |
| 8 | #1271 Vector Store Search API | `POST /v1/vector_stores/{id}/search` | #1266, #1268 |
| 9 | #1272 RAG Plugin Integration | New "vectorstore" RAG backend type in extproc pipeline | #1271 |
| 10 | #1273 Conversation Auto-Capture | Response filter captures Q+A pairs for future RAG | #1268, #1270 |
| 11 | #1274 Config, Init & Integration Tests | Config section, startup init, E2E tests | All above |

---

## Future Enhancements (deferred from initial implementation)

- [ ] PDF/DOCX file format support (needs external Go libraries)
- [ ] Redis vector store backend (third backend option)
- [ ] Batch file upload (`POST /v1/vector_stores/{id}/file_batches` - up to 500 files)
- [ ] CLI commands (`vllm-sr rag ingest`, `vllm-sr rag list`, `vllm-sr rag delete`)
- [ ] Dashboard UI for document management (upload, browse, search)
- [ ] Semantic chunking (split by meaning using embedding similarity)
- [ ] Hybrid search (vector similarity + BM25 keyword matching)
- [ ] Cross-encoder reranking for better retrieval quality
- [ ] Document update/re-embedding (detect changes, re-process)
- [ ] File deduplication (prevent duplicate uploads)
- [ ] Configurable capture mode (query_only, response_only, full_pair)
- [ ] Distributed file storage (S3/MinIO for multi-node deployments)
- [ ] Token-based chunking (needs tokenizer on Go side)
- [ ] Conversation capture filtering (exclude topics, PII scrubbing)
- [ ] Vector store expiration (auto-expire based on ExpiresAfter policy)
- [ ] File content search (search across metadata, not just embeddings)

---

## Summary

The RAG retrieval plugin (#1152) is merged and functional, but there's no native way to populate the vector database with documents using vSR's own embedding models. This creates a gap where users must rely on external tools (LlamaIndex, OpenAI, etc.) to ingest documents before vSR can retrieve from them.

This issue proposes adding a **document ingestion pipeline** that completes the RAG stack using vSR's own infrastructure.

## Problem

Current state:
```
Documents → [??? NO SOLUTION ???] → Vector DB → RAG Retrieval (#1152) → LLM
```

Users must:
- Use external tools (LlamaIndex, NVIDIA NeMo, etc.) to chunk and embed documents
- Or use OpenAI's hosted file_search (data leaves their network)
- Or manually populate Milvus/Redis

**This defeats vSR's differentiator**: own embedding models (mmBERT, multimodal) that keep data local.

## Proposed Solution

Add native document ingestion that leverages vSR's existing infrastructure, following the **OpenAI Vector Stores API** pattern:

```
Documents → Upload API → Chunk → Embed (mmBERT) → Store (Milvus/Redis)
                                      ↓
                              Uses existing Candle binding
                              Uses existing vector backends
```

### API Design (OpenAI-Compatible)

**Vector Store Management:**
- `POST /v1/vector_stores` - Create vector store
- `GET /v1/vector_stores/{id}` - Get vector store details
- `DELETE /v1/vector_stores/{id}` - Delete vector store

**File Operations:**
- `POST /v1/files` - Upload file (`purpose: "assistants"`)
- `POST /v1/vector_stores/{id}/files` - Attach file → auto chunk/embed/index
- `GET /v1/vector_stores/{id}/files` - List files with processing status

**Search:**
- `POST /v1/vector_stores/{id}/search` - Query with filters

### Conversation Auto-Capture

Additionally, a response filter will auto-capture Q+A pairs from live traffic:
```
LLM Response → Response Filter → Embed Q → Store Q+A in vector store
```
This allows vSR to learn from real usage and provide better context for future queries.

## vSR Differentiator

While following OpenAI's API pattern, vSR adds:
- **Local embeddings** using vSR's own models (mmBERT, Qwen3, etc.) - no API costs, data stays local
- **Multiple backends** - Milvus, in-memory HNSW (Redis planned)
- **Integration with routing** - RAG context used in routing decisions
- **Training-free usability** - populate knowledge base, no model training needed

## Use Cases

### 1. Enterprise Knowledge Base
```
Company uploads HR policies, product docs, FAQs
→ vSR chunks and embeds using mmBERT (local, no API cost)
→ Queries retrieve relevant context
→ Route to LLM with company-specific knowledge
```

### 2. Learning From Support Traffic
```
Customer asks question → LLM answers → vSR captures Q+A
Later, similar question → vSR retrieves past Q+A as context
→ More consistent, informed answers over time
```

### 3. Domain-Specific RAG with Routing
```
Medical org uploads clinical guidelines → collection "medical-kb"
Legal dept uploads contracts → collection "legal-kb"

Query classified as medical → retrieve from medical-kb → route to medical-LLM
Query classified as legal → retrieve from legal-kb → route to legal-LLM
```

## Related Issues/PRs

| Reference | Relationship |
|-----------|--------------|
| #1152 | RAG retrieval plugin (merged) - this completes it |
| #1194 | Agentic Memory - similar storage patterns |
| #155 | RAG-Optimized Routing - broader vision |
| #1255 | RAG CLI support - would need ingestion commands |
| #806 | Context Engineering - related context management |

## Acceptance Criteria

- [ ] Documents can be uploaded via OpenAI-compatible API
- [ ] Documents are chunked with configurable strategies
- [ ] Chunks are embedded using vSR's models (mmBERT, etc.)
- [ ] Embeddings are stored in Milvus or in-memory HNSW
- [ ] Stored documents can be retrieved by RAG plugin (#1152)
- [ ] Documents can be listed and deleted
- [ ] Multi-collection support for isolation
- [ ] Conversation auto-capture stores Q+A pairs for future RAG
- [ ] Async processing for file ingestion (like OpenAI)

---

## Future Enhancement: Quality Feedback Loop & Offline Improvement

The conversation auto-capture feature accumulates Q+A pairs over time, but without quality signals it gets **wider** (more topics) without getting **smarter** (more accurate). Captured bad answers can reinforce mistakes.

### The Problem

No quality signal exists to distinguish good captured answers from bad ones. The knowledge base grows but doesn't improve.

### Proposed Solution: Offline Quality Pipeline

A periodic offline process that curates the captured knowledge base:

1. **LLM-as-judge** — Score each stored Q+A for accuracy/helpfulness, prune low-scoring entries
2. **Cluster + consolidate** — Group similar Q+As by topic, merge duplicates into one high-quality "golden" answer per cluster
3. **Benchmark evaluation** — Test known questions against the knowledge base, measure retrieval quality, remove entries that hurt more than help
4. **Embedding fine-tuning** — Use curated Q+As to fine-tune the embedding model for better future retrieval

This creates a cycle: capture conversations (online) → curate and improve (offline) → better RAG (online).

### Connection to AvengersPro

The [AvengersPro](https://github.com/ZhangYiqun018/AvengersPro) approach is relevant here:
- AvengersPro clusters prompts using embeddings, then evaluates model performance per cluster using benchmarks
- Applied here: cluster captured conversations → evaluate answer quality per cluster → keep only high-quality answers → the knowledge base actually improves over time
- This bridges the gap between "training-free" (just capture and retrieve) and "training-aware" (curate based on quality metrics)

### Offline Improvement Approaches

| Approach | Input | Output | Complexity |
|----------|-------|--------|------------|
| LLM-as-judge scoring | Stored Q+As | Quality scores, pruned bad entries | Low |
| Cluster deduplication | Similar Q+As | One golden answer per topic | Medium |
| Benchmark eval | Known Q+A pairs | Retrieval quality metrics | Medium |
| Embedding fine-tuning | Curated Q+As | Better embedding model | High |

---

## Opportunities This Feature Unlocks

The in-router RAG + conversation capture infrastructure opens up additional capabilities beyond the initial implementation. These are **not in scope** for the current plan, but worth noting as future potential:

### "Your Router Remembers" — Institutional Knowledge
A support agent answers a tricky question through the router. Weeks later, a different user asks something similar. vSR surfaces the previous answer as context — institutional knowledge that doesn't walk out the door when someone leaves. No one had to write a doc or update a FAQ.

### "Zero-Effort Knowledge Base"
No one has to write docs or upload files. Just use the system. Over time, it builds its own knowledge base from real conversations. Day 1: empty. Day 90: hundreds of real Q+A pairs, auto-curated.

### "Context Without the Context Window"
LLMs forget after their context window (128K tokens). vSR's vector store doesn't. A conversation from 6 months ago can inform today's answer — retrieved on demand via embedding similarity.

### "Data Stays Local"
Enterprise customers who can't send data to OpenAI for embedding/storage can use vSR's local embedding models (mmBERT, Qwen3). Documents are chunked, embedded, and stored entirely on-premises. OpenAI-compatible API, but no data leaves the network.

### "No Training Required"
Upload docs and go. No ML expertise needed, no model training, no GPU clusters. The embedding model is pre-trained — just populate the knowledge base and retrieval works immediately.

### Future: Per-Tenant Memory
Multi-tenant vector stores would give each customer/team their own isolated memory. Tenant A's conversations never leak into Tenant B's context. Each tenant's knowledge base grows independently from their own traffic.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Add Document Ingestion Pipeline for RAG (Upload, Chunk, Embed, Store) #1262

Sub-Issues (Implementation Plan)

Future Enhancements (deferred from initial implementation)

Summary

Problem

Proposed Solution

API Design (OpenAI-Compatible)

Conversation Auto-Capture

vSR Differentiator

Use Cases

1. Enterprise Knowledge Base

2. Learning From Support Traffic

3. Domain-Specific RAG with Routing

Related Issues/PRs

Acceptance Criteria

Future Enhancement: Quality Feedback Loop & Offline Improvement

The Problem

Proposed Solution: Offline Quality Pipeline

Connection to AvengersPro

Offline Improvement Approaches

Opportunities This Feature Unlocks

"Your Router Remembers" — Institutional Knowledge

"Zero-Effort Knowledge Base"

"Context Without the Context Window"

"Data Stays Local"

"No Training Required"

Future: Per-Tenant Memory

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Issue	Scope	Depends On
1	#1264 Core Types & Interfaces	`pkg/vectorstore/` types + `VectorStoreBackend` interface	—
2	#1265 Text Chunking Module	Chunking strategies (auto, static) + file parsers (.txt, .md, .json, .csv, .html)	#1264
3	#1266 Vector Store Backends	Milvus + in-memory HNSW implementations	#1264
4	#1267 File Storage Service	Local filesystem file management	#1264
5	#1268 Vector Store Management API	CRUD endpoints: POST/GET/LIST/DELETE `/v1/vector_stores`	#1264, #1266
6	#1269 File Upload API	Upload endpoints: POST/GET/LIST/DELETE `/v1/files`	#1267
7	#1270 Async Ingestion Pipeline	Attach file → chunk → embed → store (async worker pool)	#1264-#1267
8	#1271 Vector Store Search API	`POST /v1/vector_stores/{id}/search`	#1266, #1268
9	#1272 RAG Plugin Integration	New "vectorstore" RAG backend type in extproc pipeline	#1271
10	#1273 Conversation Auto-Capture	Response filter captures Q+A pairs for future RAG	#1268, #1270
11	#1274 Config, Init & Integration Tests	Config section, startup init, E2E tests	All above

Reference	Relationship
#1152	RAG retrieval plugin (merged) - this completes it
#1194	Agentic Memory - similar storage patterns
#155	RAG-Optimized Routing - broader vision
#1255	RAG CLI support - would need ingestion commands
#806	Context Engineering - related context management

Approach	Input	Output	Complexity
LLM-as-judge scoring	Stored Q+As	Quality scores, pruned bad entries	Low
Cluster deduplication	Similar Q+As	One golden answer per topic	Medium
Benchmark eval	Known Q+A pairs	Retrieval quality metrics	Medium
Embedding fine-tuning	Curated Q+As	Better embedding model	High

[Feat] Add Document Ingestion Pipeline for RAG (Upload, Chunk, Embed, Store) #1262

Description

Sub-Issues (Implementation Plan)

Future Enhancements (deferred from initial implementation)

Summary

Problem

Proposed Solution

API Design (OpenAI-Compatible)

Conversation Auto-Capture

vSR Differentiator

Use Cases

1. Enterprise Knowledge Base

2. Learning From Support Traffic

3. Domain-Specific RAG with Routing

Related Issues/PRs

Acceptance Criteria

Future Enhancement: Quality Feedback Loop & Offline Improvement

The Problem

Proposed Solution: Offline Quality Pipeline

Connection to AvengersPro

Offline Improvement Approaches

Opportunities This Feature Unlocks

"Your Router Remembers" — Institutional Knowledge

"Zero-Effort Knowledge Base"

"Context Without the Context Window"

"Data Stays Local"

"No Training Required"

Future: Per-Tenant Memory

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions