Deliver one release stream that upgrades all five outcomes with safe staged rollout:
- Better top-result relevance.
- Faster searches on large codebases.
- True hybrid retrieval (semantic + keyword + symbol).
- Richer context packs (best files + why + dependency map).
- Reliability guardrails with auto-fallback.
Locked defaults:
- KPI profile:
balanced - Rollout strategy: staged
- Fallback policy: blend first, then hard revert if degradation persists
Target acceptance thresholds (vs current baseline on fixed fixture pack + holdout):
nDCG@10 >= +8%,MRR@10 >= +6%,Recall@50 >= +10%p50 latency <= -20%,p95 latency <= -25%on large dataset- Error rate must not regress; fallback activation must be correct and observable
- No breaking MCP contract changes.
- Additive-only response metadata (behind flags):
query_mode,hybrid_components,quality_guard_state,fallback_state. - New feature flags (default OFF):
CE_RETRIEVAL_HYBRID_V1,CE_RETRIEVAL_RANKING_V3,CE_CONTEXT_PACKS_V2,CE_RETRIEVAL_QUALITY_GUARD_V1. - Keep existing flags/contracts operational so flags-off behavior remains baseline-equivalent.
- depends_on:
[] - location:
docs/BENCHMARKING.md,config/ci/retrieval-quality-fixture-pack.json,scripts/ci/generate-retrieval-quality-report.ts - description: Freeze baseline dataset/splits, query sets, KPI formulas, and pass/fail thresholds.
- validation: Baseline artifacts generated and reproducible (
quality report+telemetry+ latency snapshots).
- depends_on:
[] - location:
src/internal/retrieval/*,src/mcp/tools/*,src/config/features.ts,docs/ROLLOUT_RUNBOOK.md - description: Finalize affected components, risk map, and flag/rollback matrix before code changes.
- validation: Written component map with high-risk hot paths and explicit rollback controls.
- depends_on:
[T1, T2] - location:
src/internal/retrieval/retrieve.ts,src/internal/retrieval/expandQuery.ts,src/internal/retrieval/* - description: Build bounded hybrid candidate fan-out, query routing, and merge scaffolding.
- validation: Unit tests for fan-out caps, deterministic merge order, and query-mode composition.
- depends_on:
[T3] - location:
src/internal/retrieval/rerank.ts,src/internal/retrieval/fusion.ts,src/internal/retrieval/dedupe.ts - description: Add stronger ranking signals (including symbol-aware boosts) with calibrated weights and normalization.
- validation: Query-family evaluation shows non-regressive top-k relevance under flag-off baseline parity.
- depends_on:
[T1, T2] - location:
src/internal/retrieval/retrieve.ts,src/internal/handlers/performance.ts, cache/reuse hooks - description: Reduce repeated work (memo/cache reuse), add bounded rerank budgets, and early exits.
- validation: Large-set benchmark confirms p50/p95 targets without correctness regressions.
- depends_on:
[T3, T4] - location:
src/mcp/serviceClient.ts,src/internal/handlers/context.ts,src/mcp/tools/context.ts - description: Return “best files + concise why selected + dependency map” with strict size budget.
- validation: Integration tests validate stable structure and capped response size.
- depends_on:
[T4, T5] - location: retrieval/handler guard modules +
src/config/features.ts - description: Compute rolling quality score; apply blend fallback on first degradation; hard revert on sustained failure.
- validation: Fault-injection tests prove trigger thresholds, fallback transitions, and recovery behavior.
- depends_on:
[T6, T7] - location:
src/mcp/tools/search.ts,src/mcp/tools/codebaseRetrieval.ts,src/mcp/tools/enhance.ts, observability modules - description: Wire hybrid/ranking/context/guard signals into tool outputs and metrics.
- validation: Tool snapshots remain backward compatible; new fields are additive and flag-gated.
- depends_on:
[T3, T4, T5, T6, T7, T8] - location:
tests/internal/retrieval/*,tests/tools/*,tests/ci/*,scripts/ci/* - description: Add coverage for hybrid merge, ranking boosts, latency guardrails, fallback policy, flags on/off.
- validation: CI gates all green, including quality gate and stale-cache/config-precedence checks.
- depends_on:
[T9] - location:
artifacts/bench/*,scripts/ci/generate-retrieval-quality-telemetry.ts, reporting docs - description: Produce before/after evidence for small/medium/large and holdout sets.
- validation: Signed-off evidence bundle with raw numbers and deltas for relevance/latency/reliability.
- depends_on:
[T10] - location:
docs/ROLLOUT_RUNBOOK.md,docs/BENCHMARKING_GATES.md, release readiness docs - description: Ship staged rollout plan, guard thresholds, operator playbook, and rollback checklist.
- validation: Readiness checks pass and rollback drill documented with command-level evidence.
| Wave | Tasks | Can Start When |
|---|---|---|
| 1 | T1, T2 |
Immediately |
| 2 | T3, T5 |
T1 + T2 complete |
| 3 | T4 |
T3 complete |
| 4 | T6, T7 |
T6: T3+T4, T7: T4+T5 |
| 5 | T8 |
T6 + T7 complete |
| 6 | T9 |
T3+T4+T5+T6+T7+T8 complete |
| 7 | T10 |
T9 complete |
| 8 | T11 |
T10 complete |
- Core checks:
npm run testnpm run buildnpm run ci:check:retrieval-quality-gatenpm run ci:check:retrieval-holdout-fixturenpm run ci:check:stale-cache-guardsnpm run ci:check:retrieval-config-precedence
- Benchmark checks:
npm run bench:ci:prnpm run ci:generate:retrieval-quality-reportnpm run ci:generate:retrieval-quality-telemetry
- Rollout-safety checks:
npm run ci:check:retrieval-shadow-canary-gate
- Release only if all five outcomes pass thresholds and flags-off baseline remains intact.
- Existing semantic retrieval/index primitives are extended, not replaced.
- Existing fixture pack + holdout remain source-of-truth for comparison.
- Additive metadata/flags are acceptable; no breaking MCP tool schema changes.
- If thresholds are missed, rollout stays staged and defaults remain baseline-safe.
| Task | Status | Work Log | Files |
|---|---|---|---|
T1 |
Completed | KPI thresholds and baseline contract aligned in benchmarking + fixture pack. | docs/BENCHMARKING.md, docs/BENCHMARKING_GATES.md, config/ci/retrieval-quality-fixture-pack.json |
T2 |
Completed | Added impact map and rollout safety envelope/rollback matrix for new flags. | docs/RETRIEVAL_IMPACT_MAP.md, docs/ROLLOUT_RUNBOOK.md, docs/FLAG_REGISTRY.md |
T3 |
Completed | Hybrid retrieval foundations confirmed (semantic + lexical/dense fusion already active) with metadata surfacing in tool outputs. | src/mcp/tools/search.ts, src/mcp/tools/codebaseRetrieval.ts |
T4 |
Completed | Added ranking v3 path with stronger path/symbol-aware prioritization behind flag. | src/internal/retrieval/types.ts, src/internal/retrieval/retrieve.ts, src/internal/retrieval/rerank.ts |
T5 |
Completed | Added per-variant parallel fan-out across semantic/keyword/dense retrieval calls to reduce serialized latency. | src/internal/retrieval/retrieve.ts |
T6 |
Completed | Context pack v2 data model + rendering sections (why selected, dependency map) behind flag. |
src/mcp/serviceClient.ts, src/mcp/tools/context.ts |
T7 |
Completed | Added quality-guard blend fallback trigger on weak/empty result quality and surfaced active/inactive fallback state end-to-end. | src/internal/handlers/retrieval.ts, src/internal/handlers/types.ts, src/mcp/tools/search.ts, src/mcp/tools/codebaseRetrieval.ts |
T8 |
Completed | Additive metadata fields wired across search/codebase retrieval outputs. | src/mcp/tools/search.ts, src/mcp/tools/codebaseRetrieval.ts |
T9 |
Completed | Expanded targeted tests for metadata and ranking v3 behavior. | tests/tools/search.test.ts, tests/tools/codebaseRetrieval.test.ts, tests/internal/retrieval/retrieve.test.ts |
T10 |
Completed | Deterministic quality gates and benchmark evidence regenerated; bench suite hardened with timeout + fallback behavior so bench:ci:pr no longer hangs on stalled retrieve probes. |
artifacts/bench/retrieval-quality-*.json, artifacts/bench/retrieval-shadow-canary-gate.json, artifacts/bench/retrieval-holdout-check.json, artifacts/bench/pr-baseline.json, artifacts/bench/pr-candidate.json, scripts/ci/run-bench-suite.ts |
T11 |
Completed | Runbook and flag registry finalized for staged rollout and rollback handling. | docs/ROLLOUT_RUNBOOK.md, docs/FLAG_REGISTRY.md |