Roadmap-Hinweis: Vage Bullets ohne Akzeptanzkriterien in Checkbox-Tasks überführen. Format:
- [ ] <Task> (Target: <Q/Jahr>).
v2.0.0 – Production-ready Retrieval-Augmented Generation system. 27 implementation files covering evaluation, knowledge gap detection, ethical compliance, multi-judge orchestration, streaming retrieval, cross-encoder re-ranking, hybrid BM25+vector retrieval, batch evaluation, calibration, LRU evaluation caching, REPLUG-style LLM-scored fusion, and Constitutional AI / RLAIF training pipeline.
- RAGJudge – main orchestrator for multi-dimensional evaluation
- KnowledgeGapDetector – three-level gap detection system
- LLM integration bridge to InferenceEngineEnhanced
- FaithfulnessEvaluator – fact-checking against retrieved sources
- RelevanceEvaluator – query-answer alignment scoring (TF-cosine semantic similarity)
- CompletenessEvaluator – query aspect coverage measurement
- CoherenceEvaluator – structure and readability scoring
- BiasDetector – ethical compliance checking
- ClaimExtractor – atomic claim decomposition
- ResponseParser – LLM evaluation response parsing
- PromptTemplates – template and few-shot example management
- JudgeConfig – configuration validation
- RubricEvaluator – custom rubric evaluation
- JudgeEnsemble – multi-judge voting strategies
- PairwiseComparator – head-to-head response comparison
- CoTEvaluator – chain-of-thought evaluation
- GEvalEvaluator – G-Eval framework (Liu et al., 2023)
- LLMJudgeIntegration – judge orchestration
- LLMMetaAnalyzer – performance meta-analysis
- Fast (~100 ms), Balanced (~500 ms), and Thorough (~2 s) evaluation modes
- StreamingRetriever – incremental context window filling (Issue: #2437)
- CrossEncoderReranker – re-ranking with heuristic scorer and ONNX stub (Issue: #2247)
- HallucinationDashboard – rolling-window hallucination rate tracking (Issue: #2438)
- DocumentSummarizer – multi-document summarization before context injection (Issue: #2239)
- KnowledgeGraphRetriever – knowledge graph-augmented retrieval with entity linking (Issue: #2242)
- DocumentSplitter – configurable chunk size, overlap, and strategy for document splitting (Issue: #2238)
- HybridRetriever – BM25 + vector fusion with configurable RRF weights (Issue: #1968)
- RAGJudge::extractClaims() – LLM-first + heuristic fallback dispatch (Issue: #1296, Target: Q1 2026) — Inputs: answer text; Outputs: vector of claim strings; Errors: JSON parse failure falls back to heuristic; Tests: unit + LLM mock; Perf: <500ms for 1k-char input
- RAGJudge::verifyClaimAgainstDocuments() – NLI → LLM → semantic fallback dispatch (Issue: #1296, Target: Q1 2026) — Inputs: claim + documents; Outputs: bool support decision; Errors: NLI/LLM failure cascades to term-overlap; Tests: unit + NLI mock; Perf: <200ms per claim
- NLIFaithfulnessVerifier integrated into RAGJudge for entailment-based claim verification (Issue: #1296, Target: Q1 2026) — Member of RAGJudge::Impl; threshold: 0.7; graceful degradation when model not loaded
- FaithfulnessEvaluator::extractClaims() – LLM-first + sentence-boundary fallback (Issue: #1296, Target: Q1 2026) — Inputs: answer text; Outputs: vector of Claim structs; Errors: JSON parse failure falls back to regex; LLM confidence: 0.9, heuristic confidence: 0.6
- LearningMetrics – sliding-window metrics with mean/std-dev/trend export (Issue: #1296, Target: Q1 2026) — Tracks accuracy, faithfulness, relevance, completeness, coherence; CSV export; thread-safe with std::mutex
- Citation highlighting (map answer sentences to source chunks) (Issue: #2436, #2000)
- Online learning from evaluation feedback – adaptive retrieval via Bayesian optimization over
top_kandsimilarity_threshold, driven by both user feedback and RAGJudge evaluation confidence scores;getOptimizedRetrievalParams()API (Issue: #2244) - EvaluationCache – thread-safe LRU cache with TTL expiry, invalidation triggers, and statistics tracking (
evaluation_cache.cpp) - CalibrationManager – temperature scaling, Platt scaling, and isotonic regression to align judge scores with human annotations; ECE/Brier/correlation metrics (
calibration_manager.cpp) - BatchEvaluator – parallel batch processing with configurable worker threads, async evaluation via futures/promises, and aggregated statistics (
batch_evaluator.cpp) -
batchConvertToRetrievedDocuments– implemented withEmbeddingFunctioncallback; sequential per-query K-NN search; no placeholder / DO NOT USE warning removed (rag_integration_helpers.h) -
RAGIngestionBridge— connectsIngestionToolboxto the RAG pipeline (include/rag/rag_ingestion_bridge.h,src/rag/rag_ingestion_bridge.cpp;themis::ragnamespace):indexDocument(),enrichRetrievedDocuments(),extractEntitiesForContext(),buildEntityContext();IndexResultreturn type; thread-safe (v0.1.0)
(none currently in progress)
-
OntologyAwareRetriever— ontologiegesteuertes Entity-Retrieval viaOntologyManager(Target: Q4 2026)- Affected:
include/rag/ontology_aware_retriever.h(new),src/rag/ontology_aware_retriever.cpp(new) - Inputs: Query-Text + Domain-Ontologie-Pfad; Outputs:
RetrievedDocument-Liste mit Ontologie-Kontext - Expected behavior: Entity-Linking nutzt
OntologyManager::isA()für Oberbegriff-Expansion; Retrievalpfade folgen erlaubten Relationstypen ausallowedEdgeTypes();KnowledgeGraphRetrieverwird um Reasoner-Hooks erweitert - Constraints: ≤ 50 ms Latenz für top-20 Retrieval auf Graphen mit ≤ 1 M Knoten
- Errors: unbekannte Entities → Fallback auf BM25; Ontologie nicht geladen → Standard-Retrieval
- Tests: OAR-01..OAR-08 in
tests/rag/test_ontology_aware_retriever.cpp - Perf: Entity-Expansion ≤ 5 ms; Gesamtlatenz top-20 ≤ 50 ms
- Affected:
- Agentic RAG with iterative retrieval loops (
rag/agentic_rag.cpp) (Issue: #2241) - Multi-modal RAG (image + text retrieval) (
rag/multimodal_rag.cpp) (Issue: #2243) - Online learning from evaluation feedback (adaptive retrieval) (Issue: #2244)
- Distributed RAG evaluation across multiple judge models (Issue: #2245) —
rag/distributed_rag_evaluator.h/.cpp; thread-pool parallel dispatch; MEAN/WEIGHTED_MEAN/MAJORITY_VOTING/BEST_OF_N aggregation; inter-judge agreement metric; factory helpers - Performance benchmarks (recall@10, latency targets) —
benchmarks/bench_rag_evaluation.cpp; recall@K harness; FAST/BALANCED/THOROUGH latency; batch throughput; DistributedRAGEvaluator benchmark; PromptInjectionDetector scan throughput; end-to-end pipeline - Security audit (prompt injection in retrieved context) —
rag/prompt_injection_detector.h/.cpp; pattern-based detection (instruction-override, system-prompt-leak, delimiter-escape, role-injection, markup-injection, Unicode bidi); density threshold; PromptInjectionSanitizer; full unit test coverage - Semantisches-Netz-Integration:
KnowledgeGraphRetriever+KnowledgeGraphReasonerfür Multi-Hop-Reasoning (Target: Q3 2027)- Affected:
include/rag/knowledge_graph_retriever.h,src/rag/knowledge_graph_retriever.cpp - Expected behavior:
retrieve()triggert automatischKnowledgeGraphReasoner::infer()für bis zumax_inference_hopsHops; Inferenzketten werden als Zusatzkontext eingefügt; Erklärungsketten sind inRetrievedDocument.metadata["reasoning_chain"]abrufbar - Constraints: Multi-Hop-Reasoning ≤ 200 ms P99 (≤ 5 Hops, ≤ 100 k Kanten)
- Errors: Reasoning-Timeout → Fallback auf direkte KG-Abfrage ohne Inferenz
- Tests: KGR-RAG-01..KGR-RAG-06 in
tests/rag/test_knowledge_graph_retriever_reasoning.cpp
- Affected:
- LoRA-Enhanced Domain Retrieval für Mustererkennung (Target: Q2 2027)
- Affected:
include/rag/lora_enhanced_retriever.h(new),src/rag/lora_enhanced_retriever.cpp(new) - Expected behavior: Domänenspezifische LoRA-Adapter (z. B. „legal_rag_v1", „medical_rag_v1")
re-ranken Retrievalergebnisse;
MultiLoRAManager::selectAdapterForQuery()wählt Adapter anhand von Query-Embedding-Ähnlichkeit zur Adapter-Domäne - Constraints: LoRA-Re-Ranking ≤ 100 ms für top-50 Dokumente; Guard
THEMIS_ENABLE_LLM - Errors: kein passender Adapter → Standard-RRF-Fusion; Adapter-Load-Fehler → Fallback
- Tests: LER-01..LER-06 in
tests/rag/test_lora_enhanced_retriever.cpp - Perf: Re-Ranking-Verbesserung MRR@10 ≥ +5% gegenüber reiner RRF-Baseline
- Wissensrepräsentation: LoRA-Adapter kodiert implizit domänenspezifische Konzepthierarchien
- Affected:
-
RAGJudge– main orchestrator for multi-dimensional evaluation -
KnowledgeGapDetector– three-level gap detection system - LLM integration bridge to
InferenceEngineEnhanced -
FaithfulnessEvaluator,RelevanceEvaluator,CompletenessEvaluator,CoherenceEvaluator -
BiasDetector– ethical compliance checking -
ClaimExtractor,ResponseParser,PromptTemplates,JudgeConfig -
RubricEvaluator,JudgeEnsemble,PairwiseComparator -
CoTEvaluator,GEvalEvaluator(Liu et al., 2023),LLMJudgeIntegration,LLMMetaAnalyzer - Fast (~100 ms), Balanced (~500 ms), and Thorough (~2 s) evaluation modes
- Streaming retrieval with incremental context window filling
- Re-ranking layer with cross-encoder model integration
- Hallucination rate tracking dashboard
- Hybrid retrieval (BM25 + vector) with configurable RRF weights
- Citation highlighting (map answer sentences to source chunks)
- Configurable chunk size and overlap for document splitting
- Multi-document summarization before context injection
- Per-query evaluation report export (JSON / HTML) (Issue: #2240)
- Agentic RAG with iterative retrieval loops (
rag/agentic_rag.cpp) - Knowledge graph-augmented retrieval (entity linking)
- Multi-modal RAG (image + text retrieval) (
rag/multimodal_rag.cpp) - Online learning from evaluation feedback (adaptive retrieval)
- Distributed RAG evaluation across multiple judge models (
rag/distributed_rag_evaluator.h/.cpp) (Issue: #2245) — thread-pool parallel dispatch; MEAN/WEIGHTED_MEAN/MAJORITY_VOTING/BEST_OF_N aggregation; factory helpers - Performance benchmark harness (
benchmarks/bench_rag_evaluation.cpp) — recall@K (K=1/5/10/20/50); FAST/BALANCED/THOROUGH latency; batch throughput; end-to-end pipeline benchmark - Prompt injection detection and sanitization (
rag/prompt_injection_detector.h/.cpp) — security audit for retrieved context; pattern-based heuristic detector; PromptInjectionSanitizer with configurable thresholds
-
ReplugRetriever— REPLUG-style LLM-scored retrieval fusion (rag/replug_retriever.h/.cpp) (Target: Q1 2026) — Inputs: query + RetrievedDocument list; Outputs: ReplugFusionResult with fused scores; λ interpolation, softmax temperature, min_retrieval_score filter, REPLUG-LSR weight update via KL gradient; ILLMScorer plugin; HeuristicLLMScorer (Jaccard); 30 unit tests -
RLAIFTrainer— Constitutional AI + RLAIF preference dataset generation (rag/rlaif_trainer.h/.cpp) (Target: Q1 2026) — Inputs: query + draft response; Outputs: PreferencePair (prompt, chosen, rejected); critique-revision loop; IAIJudge plugin; HeuristicAIJudge; AIPrinciple registry; processBatch(); RLAIFConfig; 30 unit tests
-
ContextWindowBudget— central token-budget model (include/llm/context_window_budget.h) (Target: Q2 2026) — Inputs: model_ctx, system_prompt, query, min_response; Outputs: available_context_tokens, reserved_response_tokens; heuristic estimator ceil(chars/3.5); 20% floor on response reservation; fallback 4096; 30 unit tests -
RAGContextAssembler— budget-aware chunk selection (include/rag/rag_context_assembler.h,src/rag/rag_context_assembler.cpp) (Target: Q2 2026) — Greedy Fill with Response Guard; truncation with configurable marker; computeMaxTokens(); 30 unit tests -
MultiStepRAGOrchestrator— Map-Reduce and Iterative strategies (include/rag/multi_step_rag.h,src/rag/multi_step_rag.cpp) (Target: Q2 2026) — Map: batch partitioning bounded by context budget; Reduce: partial-answer synthesis; Iterative: gap-detection loop, max_iterations guard, deduplication; factory helpers; 15 unit tests -
LlamaCppPlugin::loadModel()readsn_ctx/context_lengthfrom config JSON →ModelInfo::context_length; fallback 4096 (Target: Q2 2026) -
LlamaCppPlugin::generateRAG()replaced naive doc concat withRAGContextAssembler;max_tokenscapped viacomputeMaxTokens()(Target: Q2 2026) -
RAGContext::max_context_tokensset to 0 (dynamic fallback);response_budget_tokensfield added (Target: Q2 2026) -
RAGPromptConfig::reserved_response_tokensfield added (default: 512) (Target: Q2 2026) -
MultiHopReasoner— multi-hop reasoning with query decomposition (include/rag/multi_hop_reasoner.h,src/rag/multi_hop_reasoner.cpp) (Target: Q2 2026) — heuristic + LLM-based decomposition; per-hop retrieval + inference with context injection; answer composition; factory helpers (single-hop, balanced, deep-reasoning); 15 unit tests -
AdaptiveRetrieval— adaptive retrieval depth based on query complexity (include/rag/adaptive_retrieval.h,src/rag/adaptive_retrieval.cpp) (Target: Q2 2026) — QueryComplexity tiers (SIMPLE/MODERATE/COMPLEX/VERY_COMPLEX); connective/question-word heuristic; IComplexityScorer plugin; top_k + similarity_threshold scaling; factory helpers (lightweight, balanced, high-recall); 15 unit tests
Phase 8: Loop 1–4 Explicit Orchestration & Federated RLAIF — IMPL-A2 + IMPL-A3 (Status: Completed ✅)
Paper 1 — §4.4 The Four Self-Optimising Loops / §5.4 ContinuousLearningOrchestrator Issues: IMPL-A2 · IMPL-A3
- Expose explicit named loop-trigger methods on
ContinuousLearningOrchestrator(Implemented: 2026-04-19):triggerLoop1QueryExecution(const QueryExecutionOutcome&)(Loop 1 — ≤ 10 ms BaoOptimizer feedback)triggerLoop2WorkloadAdaptation()(Loop 2 — 60 s interval,WorkloadAdaptiveOptimizer+ HNSW)triggerLoop3IndexLifecycle()(Loop 3 — hours/days,IndexSuggestionEngine)triggerLoop4AdapterImprovement()(Loop 4 — weekly,IncrementalLoRATrainer)
- Add
FEDERATED_ROUND_STARTevent type toContinuousLearningOrchestrator(IMPL-A3)- Fired after Loop 4 completes; 24 h minimum interval guard
- Invokes
ILoRAFederationCoordinator::startRound()when coordinator is injected
- Add
setFederationCoordinator(ILoRAFederationCoordinator*)DI setter - Loop-interference cooldown guard:
setOptimizationCooldown(seconds)+ per-loop timestamp map (RQ10) - JSON context serialiser
serializeLoopContext()→ JSON ≤ 8 000 chars / ≈ 2 000 tokens -
RAGIngestionBridge::indexOptimizerLog()extension: index optimizer-log documents for RAG retrieval - 14 unit tests in
tests/test_clo_loops.cpp(test_clo_loops_focusedCMake target):CLO-L1-01…CLO-L1-03: Loop 1 trigger, outcome in context JSON, completion handlerCLO-L2-01…CLO-L2-03: Loop 2 trigger, context JSON, completion handlerCLO-L3-01…CLO-L3-02: Loop 3 advisory guardrail pass, context JSONCLO-L4-01…CLO-L4-02: Loop 4 trigger, context JSONCLO-FED-01:FEDERATED_ROUND_STARTfires after Loop 4 (no throw when coordinator absent)CLO-COOL-01: 60 s cooldown blocks second trigger; different loop unaffectedSerializeContext_EmptyBeforeTrigger,SerializeContext_MultipleLoopsPresent
-
LoopPhaseenum onContinuousLearningOrchestrator:LOOP_1_HNSW_QUERY,LOOP_2_WORKLOAD,LOOP_3_SCHEMA_INDEX,LOOP_4_RLAIF(include/rag/continuous_learning_orchestrator.h:249) -
triggerLoop(LoopPhase)— explicitly trigger a named learning loop; returnsLoopResult(include/rag/continuous_learning_orchestrator.h:283) -
registerLoopCompletionHandler(LoopPhase, handler)— per-phase completion callback (include/rag/continuous_learning_orchestrator.h:293) -
TriggerEvent::FEDERATED_ROUND_START— fired automatically after a successful Loop-4 run withguardrail_passed == true(include/rag/continuous_learning_orchestrator.h:309) -
setFederationCoordinator(ILoRAFederationCoordinator*)DI setter (include/rag/continuous_learning_orchestrator.h:326) -
setTrainerForFederation(IncrementalLoRATrainer*)DI setter (include/rag/continuous_learning_orchestrator.h:342) - Loop-interference cooldown guard: shared
OptimizationLockwith per-resource cooldown (RQ10) - JSON context serialiser for Loop 1–3 outcome signals →
≤ 2 000 tokenscontext block -
RAGIngestionBridgeextension: index optimizer-log documents for RAG retrieval - 12 new unit tests in
tests/test_continuous_learning_orchestrator_loops.cpp:CLO-L1-01…CLO-L1-03: Loop 1 trigger updates BaoOptimizer hintCLO-L2-01…CLO-L2-03: Loop 2 trigger updates WorkloadAdaptiveOptimizerCLO-L3-01…CLO-L3-02: Loop 3 trigger calls IndexSuggestionEngineCLO-L4-01…CLO-L4-02: Loop 4 trigger calls IncrementalLoRATrainerCLO-FED-01:FEDERATED_ROUND_STARTfires after Loop 4 + 24 h guard respectedCLO-COOL-01: cooldown guard prevents concurrent loop interference
- Benchmark design completed (Target: Q2 2026): cross-domain goldenset harness (legal/medical/financial) via
RAGTestCasebatches, red-team injection scenarios viaAdversarialTester, and standardized severity-ready outputs inBatchEvaluationResult. - Measurement pipeline completed (Target: Q2 2026): deterministic offline replay via
BatchEvaluator::evaluateBatch, online hallucination drift monitoring/alerting viaHallucinationDashboard, and decision traceability coverage metrics (traceable_decisions/untraceable_decisions) inBatchEvaluationResult. - Guardrail optimization completed (Target: Q2 2026): prompt-injection scenario accounting + success-rate tracking, bias/fairness drift detection (
bias_fairness_drift_rate), groundedness computation (groundedness_rate), and cost-to-quality efficiency metric (cost_to_quality_efficiency) inBatchEvaluator. - Release gates completed (Target: Q2 2026): configurable gate thresholds in
BatchEvaluatorConfig(hallucination, groundedness, injection success, bias drift, p95 latency, cost efficiency, traceability) with blocking decision (release_gates_passed) and explicit regression reasons (failed_release_gates). - Focused validation completed:
tests/test_rag_batch_evaluator.cppcovers injection success-rate computation, traceability coverage, bounded reliability-score ranges, and release-gate blocking behavior.
Phase 10: Ontologie-Integration & Semantisches Netz (Status: Completed ✅, Target: Q4 2026 – Q3 2027)
-
OntologyAwareRetriever— Entity-Expansion viaOntologyManager::isA(); erlaubte Pfade viaallowedEdgeTypes()(Target: Q4 2026) →include/rag/ontology_aware_retriever.h,src/rag/ontology_aware_retriever.cpp -
KnowledgeGraphRetriever+KnowledgeGraphReasonerIntegration: Multi-Hop-Reasoning bis 5 Hops; Erklärungsketten im Dokument-Metadata (Target: Q2 2027) →include/rag/knowledge_graph_retriever.h,src/rag/knowledge_graph_retriever.cpp -
LoRAEnhancedRetriever— LoRA-Adapter-Re-Ranking für domänenspezifisches Retrieval; MRR@10 ≥ +5% (Target: Q2 2027) →include/rag/lora_enhanced_retriever.h,src/rag/lora_enhanced_retriever.cpp -
Tests: OAR-01..OAR-08, KGR-RAG-01..KGR-RAG-06, LER-01..LER-06 →
tests/rag/test_ontology_aware_retriever.cpp,tests/rag/test_knowledge_graph_retriever_reasoning.cpp,tests/rag/test_lora_enhanced_retriever.cpp -
Unit tests coverage > 80% (streaming_retriever: 28 test cases; reranker: 30+ test cases; document_splitter: 37 test cases)
-
Unit tests coverage > 80% (streaming_retriever: 28 tests; reranker: 30+ tests; hybrid_retriever: 31 tests)
-
Unit tests for LearningMetrics (test_learning_metrics.cpp: recordEvaluation, computeMetrics, exportMetrics, printReport, window enforcement)
-
Unit tests for ClaimExtractor (test_claim_extractor.cpp: extract, verify, calculateFaithfulness, SelfConsistencyEvaluator)
-
Unit tests for CitationHighlighter (test_rag_citation_highlighter.cpp: comprehensive coverage; available in all build variants)
-
Unit tests for EvaluationReportExporter (test_rag_evaluation_report_exporter.cpp: JSON/HTML export; file I/O; edge cases; factory; available in all build variants)
-
Unit tests for DistributedRAGEvaluator (test_rag_distributed_evaluator.cpp: construction validation, aggregation strategies, meta fields, factory helpers, batch evaluate)
-
Unit tests for PromptInjectionDetector and Sanitizer (test_rag_prompt_injection.cpp: benign pass-through, instruction override, system-prompt leak, delimiter escape, role injection, markup injection, Unicode bidi, sanitizer truncation/replacement)
-
Unit tests for ReplugRetriever (test_rag_replug_retriever.cpp: ILLMScorer, HeuristicLLMScorer, fuse(), top_k truncation, min_retrieval_score filtering, weight updates, factory helpers; 30 tests)
-
Unit tests for RLAIFTrainer (test_rag_rlaif_trainer.cpp: IAIJudge, HeuristicAIJudge, runTrainingStep(), createPreferencePair(), processBatch(), principle management, dataset access, stats; 30 tests)
-
Unit tests for ContextWindowBudget (test_context_window_budget.cpp: estimateTokens, tokensToChars, compute, reserved_response_tokens enforcement, available_context_tokens arithmetic, helpers; 30 tests)
-
Unit tests for RagContextAssembler (test_rag_context_assembler.cpp: empty edge cases, single chunk fit/truncation, greedy fill, response-guard, truncation marker, computeMaxTokens; 30 tests)
-
Unit tests for MultiStepRAGOrchestrator (test_multi_step_rag.cpp: map-reduce single-pass, multi-batch, iterative cap, factory helpers; 15 tests)
-
Unit tests for MultiHopReasoner (test_rag_multi_hop_reasoner.cpp: 15 tests — A config/factory, B decomposition heuristic+LLM, C pipeline single/multi/error cases)
-
Unit tests for AdaptiveRetrieval (test_rag_adaptive_retrieval.cpp: 15 tests — A config/factory, B complexity analysis, C params + custom scorer injection)
-
Performance benchmarks (benchmarks/bench_rag_evaluation.cpp: recall@K harness, FAST/BALANCED/THOROUGH latency, distributed evaluator, injection scan throughput, end-to-end pipeline)
-
Integration tests (full pipeline: retrieve → generate → evaluate) —
test_rag_pipeline_integration.cpp(heuristic/FAST mode, no live LLM required) -
Performance benchmarks (recall@10, latency per mode)
-
Security audit (prompt injection in retrieved context)
-
Documentation complete (streaming_retriever.h, reranker.h, hybrid_retriever.h: full Doxygen API docs)
-
API stability guaranteed (streaming_retriever API: stable; CrossEncoderConfig: stable; HybridRetrieverConfig: stable)
- Evaluation accuracy depends on quality of the injected LLM judge model.
- Thorough mode (~2 s latency) is not suitable for real-time interactive use.
- No built-in document chunking strategy: now provided by
DocumentSplitter(configurable chunk size, overlap, and strategy).
Scientific basis: Laban et al., "LLMs Corrupt Your Documents When You Delegate" (arXiv:2604.15597)
Status: Phase 1–6 Completed ✅ (Target: Q3 2026)
Implemented in include/rag/delegate_evaluator.h + src/rag/delegate_evaluator.cpp.
18 unit tests in tests/test_delegate_evaluator.cpp (CTest target: DelegateEvaluatorFocusedTests).
Performance benchmark in benchmarks/bench_delegate_evaluator.cpp.
-
IDomainEvaluatorinterface — RS@k in[0.0, 1.0], clamped (Target: Q2 2026) -
JsonDocumentEvaluator— field-level overlap; non-JSON fallback toPlainTextEvaluator(Target: Q2 2026) -
AqlQueryEvaluator— token-level Jaccard similarity (Target: Q2 2026) -
PlainTextEvaluator— normalised Levenshtein edit distance (Target: Q2 2026) -
XmlProcessEvaluator— element-count + attribute overlap (ARIS/BPMN domain) (Target: Q2 2026) -
MarkdownEvaluator— delegates toPlainTextEvaluator(Target: Q2 2026) -
RoundTripSimulator::run()— k forward→backward rounds; RS@k history; catastrophic counter (Target: Q2 2026) -
DelegateEvaluatorFactory—createForDomain(),createSimulator()(Target: Q2 2026) - Edge-case handling: empty documents,
EditFnexceptions (EDIT_FAILED), 0 rounds → RS@0 = 1.0 (Target: Q2 2026) - 18 unit tests DE-01..DE-18 + ancillary checks (Target: Q2 2026)
-
BM_DelegateEvaluator_JsonRoundTrip_10kbenchmark (Target: Q3 2026) - Doxygen API docs for all public APIs (Target: Q2 2026)
- Extend to 10+ domains via
IDomainEvaluatorplugin (Target: Q4 2026) - Connect
RoundTripSimulatortoAgenticRAGas a pre-production safety net (Target: Q4 2026) →AgenticRAGConfig::RelayGuardConfig;AgenticRAGResult::delegate_relay; best-effort post-loop relay;tests/test_agentic_rag_relay.cpp(ARR-01..04) - Persist RS@k history via
IDocumentStorefor trend analysis (Target: Q1 2027) →IRoundTripEditor+StoreBackedRoundTripEditor(include/document/round_trip_editor.h,src/document/round_trip_editor.cpp);RelayResult::persistence_write_failurescounter; DE-16/DE-16b tests
| Aspect | DELEGATE-52 (Paper) | ThemisDB-Umsetzung |
|---|---|---|
| Domains | 52 | 4 Kerndomänen (JSON, AQL, Text, XML) — erweiterbar |
| EditFn | Real LLM via OpenAI/Azure | Injizierbare EditFn-Lambda (LLM-agnostisch) |
| Dataset | 234 HuggingFace environments | Synthetische In-Process-Fixtures |
| RS@k | Domänenspezifische Scorer aus Repo | Eigene Scorer, methodisch äquivalent |
| Ziel | Benchmark 19 LLMs | Qualitätssicherung agentischer Workflows |
- Evaluator scoring API (0–1 float range) is stable from v1.x.
- JudgeConfig fields may gain new optional parameters; backward-compatible.
Stand: 2026-04-20 – Quelle: src/UNUSED_FUNCTIONS_REPORT.md
ABTestingFramework– A/B-Testing für RAG-Pipelines (Retrieval-/Ranking-Strategien)Aktion: ROADMAP-Ticket für Produktions-Integration ergänzen oder als CANDIDATE_FOR_REMOVAL markieren.