Roadmap-Hinweis: Vage Bullets ohne Akzeptanzkriterien in Checkbox-Tasks überführen. Format: - [ ] <Task> (Target: <Q/Jahr>).

RAG Module Roadmap

Current Status

v2.0.0 – Production-ready Retrieval-Augmented Generation system. 27 implementation files covering evaluation, knowledge gap detection, ethical compliance, multi-judge orchestration, streaming retrieval, cross-encoder re-ranking, hybrid BM25+vector retrieval, batch evaluation, calibration, LRU evaluation caching, REPLUG-style LLM-scored fusion, and Constitutional AI / RLAIF training pipeline.

Completed ✅

In Progress 🚧

(none currently in progress)

Planned Features 📋

Short-term (Next 3-6 months)

OntologyAwareRetriever — ontologiegesteuertes Entity-Retrieval via OntologyManager (Target: Q4 2026)
- Affected: include/rag/ontology_aware_retriever.h (new), src/rag/ontology_aware_retriever.cpp (new)
- Inputs: Query-Text + Domain-Ontologie-Pfad; Outputs: RetrievedDocument-Liste mit Ontologie-Kontext
- Expected behavior: Entity-Linking nutzt OntologyManager::isA() für Oberbegriff-Expansion; Retrievalpfade folgen erlaubten Relationstypen aus allowedEdgeTypes(); KnowledgeGraphRetriever wird um Reasoner-Hooks erweitert
- Constraints: ≤ 50 ms Latenz für top-20 Retrieval auf Graphen mit ≤ 1 M Knoten
- Errors: unbekannte Entities → Fallback auf BM25; Ontologie nicht geladen → Standard-Retrieval
- Tests: OAR-01..OAR-08 in tests/rag/test_ontology_aware_retriever.cpp
- Perf: Entity-Expansion ≤ 5 ms; Gesamtlatenz top-20 ≤ 50 ms

Long-term (6-12 months)

Agentic RAG with iterative retrieval loops (rag/agentic_rag.cpp) (Issue: #2241)
Multi-modal RAG (image + text retrieval) (rag/multimodal_rag.cpp) (Issue: #2243)
Online learning from evaluation feedback (adaptive retrieval) (Issue: #2244)
Distributed RAG evaluation across multiple judge models (Issue: #2245) — rag/distributed_rag_evaluator.h/.cpp; thread-pool parallel dispatch; MEAN/WEIGHTED_MEAN/MAJORITY_VOTING/BEST_OF_N aggregation; inter-judge agreement metric; factory helpers
Performance benchmarks (recall@10, latency targets) — benchmarks/bench_rag_evaluation.cpp; recall@K harness; FAST/BALANCED/THOROUGH latency; batch throughput; DistributedRAGEvaluator benchmark; PromptInjectionDetector scan throughput; end-to-end pipeline
Security audit (prompt injection in retrieved context) — rag/prompt_injection_detector.h/.cpp; pattern-based detection (instruction-override, system-prompt-leak, delimiter-escape, role-injection, markup-injection, Unicode bidi); density threshold; PromptInjectionSanitizer; full unit test coverage
Semantisches-Netz-Integration: KnowledgeGraphRetriever + KnowledgeGraphReasoner für Multi-Hop-Reasoning (Target: Q3 2027)
- Affected: include/rag/knowledge_graph_retriever.h, src/rag/knowledge_graph_retriever.cpp
- Expected behavior: retrieve() triggert automatisch KnowledgeGraphReasoner::infer() für bis zu max_inference_hops Hops; Inferenzketten werden als Zusatzkontext eingefügt; Erklärungsketten sind in RetrievedDocument.metadata["reasoning_chain"] abrufbar
- Constraints: Multi-Hop-Reasoning ≤ 200 ms P99 (≤ 5 Hops, ≤ 100 k Kanten)
- Errors: Reasoning-Timeout → Fallback auf direkte KG-Abfrage ohne Inferenz
- Tests: KGR-RAG-01..KGR-RAG-06 in tests/rag/test_knowledge_graph_retriever_reasoning.cpp
LoRA-Enhanced Domain Retrieval für Mustererkennung (Target: Q2 2027)
- Affected: include/rag/lora_enhanced_retriever.h (new), src/rag/lora_enhanced_retriever.cpp (new)
- Expected behavior: Domänenspezifische LoRA-Adapter (z. B. „legal_rag_v1", „medical_rag_v1") re-ranken Retrievalergebnisse; MultiLoRAManager::selectAdapterForQuery() wählt Adapter anhand von Query-Embedding-Ähnlichkeit zur Adapter-Domäne
- Constraints: LoRA-Re-Ranking ≤ 100 ms für top-50 Dokumente; Guard THEMIS_ENABLE_LLM
- Errors: kein passender Adapter → Standard-RRF-Fusion; Adapter-Load-Fehler → Fallback
- Tests: LER-01..LER-06 in tests/rag/test_lora_enhanced_retriever.cpp
- Perf: Re-Ranking-Verbesserung MRR@10 ≥ +5% gegenüber reiner RRF-Baseline
- Wissensrepräsentation: LoRA-Adapter kodiert implizit domänenspezifische Konzepthierarchien

Implementation Phases

Phase 1: Evaluation Pipeline & Multi-Judge System (Status: Completed ✅)

RAGJudge – main orchestrator for multi-dimensional evaluation
KnowledgeGapDetector – three-level gap detection system
LLM integration bridge to InferenceEngineEnhanced
FaithfulnessEvaluator, RelevanceEvaluator, CompletenessEvaluator, CoherenceEvaluator
BiasDetector – ethical compliance checking
ClaimExtractor, ResponseParser, PromptTemplates, JudgeConfig
RubricEvaluator, JudgeEnsemble, PairwiseComparator
CoTEvaluator, GEvalEvaluator (Liu et al., 2023), LLMJudgeIntegration, LLMMetaAnalyzer
Fast (~100 ms), Balanced (~500 ms), and Thorough (~2 s) evaluation modes

Phase 2: Streaming Retrieval & Re-Ranking (Status: Completed ✅)

Streaming retrieval with incremental context window filling
Re-ranking layer with cross-encoder model integration
Hallucination rate tracking dashboard

Phase 3: Hybrid Retrieval & Citation Highlighting (Status: Completed ✅)

Hybrid retrieval (BM25 + vector) with configurable RRF weights
Citation highlighting (map answer sentences to source chunks)
Configurable chunk size and overlap for document splitting
Multi-document summarization before context injection
Per-query evaluation report export (JSON / HTML) (Issue: #2240)

Phase 4: Agentic & Knowledge-Graph RAG (Status: Completed ✅)

Agentic RAG with iterative retrieval loops (rag/agentic_rag.cpp)
Knowledge graph-augmented retrieval (entity linking)
Multi-modal RAG (image + text retrieval) (rag/multimodal_rag.cpp)
Online learning from evaluation feedback (adaptive retrieval)

Phase 5: Distributed Evaluation, Benchmarks & Security (Status: Completed ✅)

Distributed RAG evaluation across multiple judge models (rag/distributed_rag_evaluator.h/.cpp) (Issue: #2245) — thread-pool parallel dispatch; MEAN/WEIGHTED_MEAN/MAJORITY_VOTING/BEST_OF_N aggregation; factory helpers
Performance benchmark harness (benchmarks/bench_rag_evaluation.cpp) — recall@K (K=1/5/10/20/50); FAST/BALANCED/THOROUGH latency; batch throughput; end-to-end pipeline benchmark
Prompt injection detection and sanitization (rag/prompt_injection_detector.h/.cpp) — security audit for retrieved context; pattern-based heuristic detector; PromptInjectionSanitizer with configurable thresholds

Phase 6: REPLUG Co-Training & Constitutional AI / RLAIF (Status: Completed ✅)

ReplugRetriever — REPLUG-style LLM-scored retrieval fusion (rag/replug_retriever.h/.cpp) (Target: Q1 2026) — Inputs: query + RetrievedDocument list; Outputs: ReplugFusionResult with fused scores; λ interpolation, softmax temperature, min_retrieval_score filter, REPLUG-LSR weight update via KL gradient; ILLMScorer plugin; HeuristicLLMScorer (Jaccard); 30 unit tests
RLAIFTrainer — Constitutional AI + RLAIF preference dataset generation (rag/rlaif_trainer.h/.cpp) (Target: Q1 2026) — Inputs: query + draft response; Outputs: PreferencePair (prompt, chosen, rejected); critique-revision loop; IAIJudge plugin; HeuristicAIJudge; AIPrinciple registry; processBatch(); RLAIFConfig; 30 unit tests

Phase 7: Context-Window Management & Token Budget (Status: Completed ✅)

ContextWindowBudget — central token-budget model (include/llm/context_window_budget.h) (Target: Q2 2026) — Inputs: model_ctx, system_prompt, query, min_response; Outputs: available_context_tokens, reserved_response_tokens; heuristic estimator ceil(chars/3.5); 20% floor on response reservation; fallback 4096; 30 unit tests
RAGContextAssembler — budget-aware chunk selection (include/rag/rag_context_assembler.h, src/rag/rag_context_assembler.cpp) (Target: Q2 2026) — Greedy Fill with Response Guard; truncation with configurable marker; computeMaxTokens(); 30 unit tests
MultiStepRAGOrchestrator — Map-Reduce and Iterative strategies (include/rag/multi_step_rag.h, src/rag/multi_step_rag.cpp) (Target: Q2 2026) — Map: batch partitioning bounded by context budget; Reduce: partial-answer synthesis; Iterative: gap-detection loop, max_iterations guard, deduplication; factory helpers; 15 unit tests
LlamaCppPlugin::loadModel() reads n_ctx/context_length from config JSON → ModelInfo::context_length; fallback 4096 (Target: Q2 2026)
LlamaCppPlugin::generateRAG() replaced naive doc concat with RAGContextAssembler; max_tokens capped via computeMaxTokens() (Target: Q2 2026)
RAGContext::max_context_tokens set to 0 (dynamic fallback); response_budget_tokens field added (Target: Q2 2026)
RAGPromptConfig::reserved_response_tokens field added (default: 512) (Target: Q2 2026)
MultiHopReasoner — multi-hop reasoning with query decomposition (include/rag/multi_hop_reasoner.h, src/rag/multi_hop_reasoner.cpp) (Target: Q2 2026) — heuristic + LLM-based decomposition; per-hop retrieval + inference with context injection; answer composition; factory helpers (single-hop, balanced, deep-reasoning); 15 unit tests
AdaptiveRetrieval — adaptive retrieval depth based on query complexity (include/rag/adaptive_retrieval.h, src/rag/adaptive_retrieval.cpp) (Target: Q2 2026) — QueryComplexity tiers (SIMPLE/MODERATE/COMPLEX/VERY_COMPLEX); connective/question-word heuristic; IComplexityScorer plugin; top_k + similarity_threshold scaling; factory helpers (lightweight, balanced, high-recall); 15 unit tests

Phase 8: Loop 1–4 Explicit Orchestration & Federated RLAIF — IMPL-A2 + IMPL-A3 (Status: Completed ✅)

Paper 1 — §4.4 The Four Self-Optimising Loops / §5.4 ContinuousLearningOrchestrator Issues: IMPL-A2 · IMPL-A3

Phase 9: AI Reliability & Safety Evaluation Program (Status: Completed ✅)

Benchmark design completed (Target: Q2 2026): cross-domain goldenset harness (legal/medical/financial) via RAGTestCase batches, red-team injection scenarios via AdversarialTester, and standardized severity-ready outputs in BatchEvaluationResult.
Measurement pipeline completed (Target: Q2 2026): deterministic offline replay via BatchEvaluator::evaluateBatch, online hallucination drift monitoring/alerting via HallucinationDashboard, and decision traceability coverage metrics (traceable_decisions/untraceable_decisions) in BatchEvaluationResult.
Guardrail optimization completed (Target: Q2 2026): prompt-injection scenario accounting + success-rate tracking, bias/fairness drift detection (bias_fairness_drift_rate), groundedness computation (groundedness_rate), and cost-to-quality efficiency metric (cost_to_quality_efficiency) in BatchEvaluator.
Release gates completed (Target: Q2 2026): configurable gate thresholds in BatchEvaluatorConfig (hallucination, groundedness, injection success, bias drift, p95 latency, cost efficiency, traceability) with blocking decision (release_gates_passed) and explicit regression reasons (failed_release_gates).
Focused validation completed: tests/test_rag_batch_evaluator.cpp covers injection success-rate computation, traceability coverage, bounded reliability-score ranges, and release-gate blocking behavior.

Phase 10: Ontologie-Integration & Semantisches Netz (Status: Completed ✅, Target: Q4 2026 – Q3 2027)

Known Issues & Limitations

Evaluation accuracy depends on quality of the injected LLM judge model.
Thorough mode (~2 s latency) is not suitable for real-time interactive use.
No built-in document chunking strategy: now provided by DocumentSplitter (configurable chunk size, overlap, and strategy).

DELEGATE-52 Benchmark Integration

Scientific basis: Laban et al., "LLMs Corrupt Your Documents When You Delegate" (arXiv:2604.15597)

Status: Phase 1–6 Completed ✅ (Target: Q3 2026)

Current Status

Implemented in include/rag/delegate_evaluator.h + src/rag/delegate_evaluator.cpp. 18 unit tests in tests/test_delegate_evaluator.cpp (CTest target: DelegateEvaluatorFocusedTests). Performance benchmark in benchmarks/bench_delegate_evaluator.cpp.

Completed

Planned

Extend to 10+ domains via IDomainEvaluator plugin (Target: Q4 2026)
Connect RoundTripSimulator to AgenticRAG as a pre-production safety net (Target: Q4 2026) → AgenticRAGConfig::RelayGuardConfig; AgenticRAGResult::delegate_relay; best-effort post-loop relay; tests/test_agentic_rag_relay.cpp (ARR-01..04)
Persist RS@k history via IDocumentStore for trend analysis (Target: Q1 2027) → IRoundTripEditor + StoreBackedRoundTripEditor (include/document/round_trip_editor.h, src/document/round_trip_editor.cpp); RelayResult::persistence_write_failures counter; DE-16/DE-16b tests

Domain Comparison

Aspect	DELEGATE-52 (Paper)	ThemisDB-Umsetzung
Domains	52	4 Kerndomänen (JSON, AQL, Text, XML) — erweiterbar
EditFn	Real LLM via OpenAI/Azure	Injizierbare `EditFn`-Lambda (LLM-agnostisch)
Dataset	234 HuggingFace environments	Synthetische In-Process-Fixtures
RS@k	Domänenspezifische Scorer aus Repo	Eigene Scorer, methodisch äquivalent
Ziel	Benchmark 19 LLMs	Qualitätssicherung agentischer Workflows

Breaking Changes

Evaluator scoring API (0–1 float range) is stable from v1.x.
JudgeConfig fields may gain new optional parameters; backward-compatible.

Latente Symbole (Unused-Functions-Audit)

Stand: 2026-04-20 – Quelle: src/UNUSED_FUNCTIONS_REPORT.md

🧪 NUR_TESTS (implementiert, kein Produktions-Aufrufer)

ABTestingFramework – A/B-Testing für RAG-Pipelines (Retrieval-/Ranking-Strategien)

Aktion: ROADMAP-Ticket für Produktions-Integration ergänzen oder als CANDIDATE_FOR_REMOVAL markieren.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG Module Roadmap

Current Status

Completed ✅

In Progress 🚧

Planned Features 📋

Short-term (Next 3-6 months)

Long-term (6-12 months)

Implementation Phases

Phase 1: Evaluation Pipeline & Multi-Judge System (Status: Completed ✅)

Phase 2: Streaming Retrieval & Re-Ranking (Status: Completed ✅)

Phase 3: Hybrid Retrieval & Citation Highlighting (Status: Completed ✅)

Phase 4: Agentic & Knowledge-Graph RAG (Status: Completed ✅)

Phase 5: Distributed Evaluation, Benchmarks & Security (Status: Completed ✅)

Phase 6: REPLUG Co-Training & Constitutional AI / RLAIF (Status: Completed ✅)

Phase 7: Context-Window Management & Token Budget (Status: Completed ✅)

Phase 8: Loop 1–4 Explicit Orchestration & Federated RLAIF — IMPL-A2 + IMPL-A3 (Status: Completed ✅)

Phase 9: AI Reliability & Safety Evaluation Program (Status: Completed ✅)

Phase 10: Ontologie-Integration & Semantisches Netz (Status: Completed ✅, Target: Q4 2026 – Q3 2027)

Known Issues & Limitations

DELEGATE-52 Benchmark Integration

Current Status

Completed

Planned

Domain Comparison

Breaking Changes

Latente Symbole (Unused-Functions-Audit)

🧪 NUR_TESTS (implementiert, kein Produktions-Aufrufer)

FilesExpand file tree

ROADMAP.md

Latest commit

History

ROADMAP.md

File metadata and controls

RAG Module Roadmap

Current Status

Completed ✅

In Progress 🚧

Planned Features 📋

Short-term (Next 3-6 months)

Long-term (6-12 months)

Implementation Phases

Phase 1: Evaluation Pipeline & Multi-Judge System (Status: Completed ✅)

Phase 2: Streaming Retrieval & Re-Ranking (Status: Completed ✅)

Phase 3: Hybrid Retrieval & Citation Highlighting (Status: Completed ✅)

Phase 4: Agentic & Knowledge-Graph RAG (Status: Completed ✅)

Phase 5: Distributed Evaluation, Benchmarks & Security (Status: Completed ✅)

Phase 6: REPLUG Co-Training & Constitutional AI / RLAIF (Status: Completed ✅)

Phase 7: Context-Window Management & Token Budget (Status: Completed ✅)

Phase 8: Loop 1–4 Explicit Orchestration & Federated RLAIF — IMPL-A2 + IMPL-A3 (Status: Completed ✅)

Phase 9: AI Reliability & Safety Evaluation Program (Status: Completed ✅)

Phase 10: Ontologie-Integration & Semantisches Netz (Status: Completed ✅, Target: Q4 2026 – Q3 2027)

Known Issues & Limitations

DELEGATE-52 Benchmark Integration

Current Status

Completed

Planned

Domain Comparison

Breaking Changes

Latente Symbole (Unused-Functions-Audit)

🧪 NUR_TESTS (implementiert, kein Produktions-Aufrufer)