feat(missions): self-hosted semantic search over the knowledge base#18264
feat(missions): self-hosted semantic search over the knowledge base#18264bmvinay7 wants to merge 2 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for kubestellarconsole ready!Built without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Introduces a self-hosted hybrid retrieval (BM25 + dense hashed embeddings) system for the KB and exposes it via a new public missions search endpoint.
Changes:
- Add a new
pkg/kb/ragpackage implementing tokenization, query expansion, intent boosting, BM25, a model-free dense embedder, and hybrid fusion. - Add unit tests covering core retrieval behavior and embedding utilities.
- Add
GET /api/missions/searchwith lazy, fingerprinted retriever build/rebuild based onfixes/index.json.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/kb/rag/tokenize.go | Shared tokenizer + subword trigram generation for BM25 and dense embedding alignment |
| pkg/kb/rag/retriever_test.go | Mini-corpus tests validating retrieval behavior, expansion, intent, and vector utils |
| pkg/kb/rag/retriever.go | Hybrid retriever with RRF fusion, late chunking max-pool scoring, and result ranking |
| pkg/kb/rag/rag.go | Core public types (Document, Result, etc.) and lexical text construction |
| pkg/kb/rag/intent.go | Query intent detection + score multipliers to bias install vs fixer missions |
| pkg/kb/rag/hashembedder.go | Deterministic model-free dense embedder using signed hashing + IDF + trigrams |
| pkg/kb/rag/expand.go | Curated concept→keyword query expansion to bridge vocabulary gaps |
| pkg/kb/rag/embedder.go | Embedder interface + cosine/normalize + truncation/quantization utilities |
| pkg/kb/rag/corpus.go | Corpus parsing + late-chunking of documents into title/description/meta chunks |
| pkg/kb/rag/build.go | Convenience constructors for default retriever from docs or index JSON |
| pkg/kb/rag/bm25.go | BM25 scorer implementation over tokenized corpus |
| pkg/kb/rag/README.md | Package documentation and agent tool contract |
| pkg/api/handlers/missions/types.go | Add cached retriever fields/mutex to missions handler |
| pkg/api/handlers/missions/search.go | New /api/missions/search endpoint + lazy retriever build based on fingerprint |
| pkg/api/handlers/missions/scores.go | Refactor: expose fetchMissionIndexBytes for reuse by scores + semantic search |
| pkg/api/handlers/missions/handler.go | Register /search route under public missions API |
Comments suppressed due to low confidence (1)
pkg/api/handlers/missions/scores.go:37
- The new shared
fetchMissionIndexBytesreturns a very generic error message (\"GitHub raw content error\"). Since this error now gates both score exposure and semantic search, it would be more actionable to include at least the HTTP status code (and possibly the URL) in the error so operators can diagnose outages/misconfig quickly.
if res.CacheStatus == cacheStatusMiss {
if res.StatusCode != http.StatusOK {
return nil, fmt.Errorf("GitHub raw content error")
}
| out := make([]Result, 0, len(fusedByDoc)) | ||
| for _, f := range fusedByDoc { | ||
| // Skip documents with no signal at all in either pool. | ||
| if f.denseRank == 0 && f.lexRank == 0 { | ||
| continue | ||
| } | ||
| out = append(out, Result{ | ||
| Document: r.docs[f.doc], | ||
| Score: f.score * intent.boost(r.docs[f.doc]), | ||
| DenseRank: f.denseRank, | ||
| LexicalRank: f.lexRank, | ||
| }) | ||
| } | ||
| sort.SliceStable(out, func(i, j int) bool { return out[i].Score > out[j].Score }) | ||
| if len(out) > k { | ||
| out = out[:k] | ||
| } | ||
| return out |
| idf := idx.idf[term] | ||
| denom := f + bm25K1*(1-bm25B+bm25B*dl/idx.avgLen) | ||
| s += idf * (f * (bm25K1 + 1) / denom) |
| "mesh": {"service-mesh", "linkerd", "istio", "networking"}, | ||
| "ingress": {"ingress", "nginx", "traefik", "contour", "networking"}, |
| func expandQuery(query string) string { | ||
| added := make(map[string]struct{}) | ||
| var extra []string | ||
| for _, tok := range tokenize(query) { | ||
| for _, syn := range conceptExpansions[tok] { | ||
| if _, ok := added[syn]; ok { | ||
| continue | ||
| } | ||
| added[syn] = struct{}{} | ||
| extra = append(extra, syn) | ||
| } | ||
| } |
| if len(query) > searchMaxQueryLen { | ||
| query = query[:searchMaxQueryLen] | ||
| } |
| h.retrieverMu.Lock() | ||
| defer h.retrieverMu.Unlock() | ||
|
|
||
| if h.retriever != nil && h.retrieverFingerprint == fingerprint { | ||
| return h.retriever, nil | ||
| } | ||
|
|
||
| retriever, err := rag.NewDefaultRetrieverFromIndex(body) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| h.retriever = retriever | ||
| h.retrieverFingerprint = fingerprint | ||
| slog.Info("[missions] built semantic search index", "docs", retriever.Len()) | ||
| return retriever, nil |
|
Thanks @copilot — addressed all 6 findings in
Verification: |
Mission retrieval was exact-slug matching (intentMatcher.ts): it only
resolved when the user named the project almost exactly, and only for
install intent. This adds semantic retrieval over the full knowledge base
(~1,600 CNCF install missions, fixers, runbooks) so a natural-language goal
("set up TLS certs", "reduce my cloud costs", "my pod gets permission
denied") returns the right mission.
No external embedding API and no model download: everything runs in-process,
so air-gapped deployments keep working and there are no new runtime
dependencies. Architecture is inspired by jina-embeddings-v5 but is our own
implementation: asymmetric query/passage encoding, BM25 + dense hybrid fused
with Reciprocal Rank Fusion, late chunking, intent boosting, and curated
query expansion. The Embedder interface is the seam for a neural model later.
- pkg/kb/rag/: new model-free retrieval engine + tests
- GET /api/missions/search?q=&k=: public read endpoint + search_missions
agent-tool contract; records a KB gap when nothing matches
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: bmvinay7 <vinaybm1234@gmail.com>
- retriever: add deterministic tie-breakers (score, lexical/dense rank,
path) so Score ties no longer fall back to randomized map order
- bm25: guard avgLen==0 (all-stopword/empty corpus) to avoid Inf/NaN scores
- expand: skip synonyms already present in the query so mappings like
"ingress"->{"ingress",...} don't double-weight a term
- search: truncate the query by rune count, not bytes, to avoid splitting a
UTF-8 character in the echoed query / gap keys / logs
- search: build the retriever outside the lock (double-checked publish) so
concurrent /search calls aren't blocked during indexing
- tests: determinism, all-stopword no-NaN, expansion de-dup
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: bmvinay7 <vinaybm1234@gmail.com>
022a147 to
de70341
Compare
|
👋 Thanks for this substantial contribution @bmvinay7! This is exciting — semantic search over the mission knowledge base is a great enhancement. I'm reviewing the implementation now. Initial observations: ✅ Strong architectural decisions:
✅ Good engineering practices:
I'll complete a full code review within 24 hours. In the meantime, the CI checks look good and your response to automated feedback shows attention to detail. Great work on your first feature contribution! 🎉 |
There was a problem hiding this comment.
Comprehensive Code Review
Summary
This is outstanding work @bmvinay7 — a well-architected, production-ready semantic search implementation. Approving.
What I Reviewed
✅ API handler (search.go) - clean endpoint design with proper bounds/validation
✅ Retrieval engine (pkg/kb/rag/*) - solid BM25 + dense hybrid with RRF fusion
✅ Test coverage (retriever_test.go) - comprehensive scenarios
✅ Refactoring (scores.go) - proper separation of concerns
✅ Integration points - caching, KB gap tracking, index fingerprinting
Design Strengths
1. Model-Free Air-Gap Compatible Design
- Hash-based feature embeddings (no external API calls)
- No model downloads or ONNX runtime
- Perfect for enterprise/air-gapped deployments
- The Embedder interface is the right seam for future neural models
2. Robust Retrieval Architecture
- Hybrid retrieval (BM25 + dense) handles both exact matches and semantic similarity
- Reciprocal Rank Fusion with tuned weights (BM25 weighted higher - correct for structured KB)
- Late chunking with max-pooling handles long mission documents properly
- Intent boosting elegantly solves the install-vs-fixer ranking problem
- Query expansion bridges vocabulary gaps without external models
3. Production-Grade Implementation
- Lazy cache building with SHA-256 fingerprinting (no unnecessary rebuilds)
- Lock-free build path (stampede is acceptable, publish is serialized)
- KB gap tracking integration (reuses existing metrics)
- Proper bounds (max query length, max k, rune-aware truncation)
- Deterministic ranking (explicit tie-breakers prevent flaky results)
4. Testing Quality
Tests validate the critical behaviors:
- ✅ Exact project name matching
- ✅ Vocabulary bridging ("TLS certificates" → cert-manager)
- ✅ Intent detection (install queries surface install missions, not fixers)
- ✅ Query expansion ("monitoring" → prometheus)
- ✅ Troubleshooting queries ("permission denied" → RBAC fixer)
- ✅ Edge cases (empty query, k limits)
Code Quality Observations
Excellent Practices
- Named constants - no magic numbers (searchMaxK, rrfK, minDenseCos)
- Comprehensive docs - every public function/type documented
- Error handling - proper error propagation and logging
- Thread safety - retriever is read-safe after construction
- Resource bounds - caps on query length, result count, fusion depth
- UTF-8 safety - rune-aware truncation prevents corrupting multi-byte chars
Minor Suggestions (Non-Blocking)
1. Consider Adding Observability Metrics
In SearchMissions, after retriever.Search(), tracking query patterns, cache hit rate, search latency, and KB gap frequency would help operations.
2. Future: Netlify Function Parity
You noted this in "Out of scope" which is correct for this PR, but for production deployment on console.kubestellar.io, web/netlify/functions/missions-search.mts will eventually need to mirror this endpoint (similar to existing missions-browse, missions-file, missions-scores).
3. Potential Enhancement: Expose Dense/Lexical Scores
For debugging/tuning, the agent might benefit from seeing the component scores in the Result type. But this can wait for actual demand.
Integration Notes
This PR is Safe to Merge
- ✅ Purely additive (existing slug matcher untouched)
- ✅ No auth changes needed (/api/missions/ already public-read)
- ✅ Backward compatible (new endpoint, no breaking changes)
- ✅ Well-tested (unit tests pass, E2E verification documented)
Follow-Up Work (Separate PRs)
- Frontend wiring: update intentMatcher.ts to call /api/missions/search as fallback
- Netlify Function: missions-search.mts for hosted demo parity
- MCP tool registration: expose search_missions as callable agent tool
- Observability: metrics for search latency, cache hit rate, query patterns
Verdict
This is exactly the kind of contribution we want to see:
- Solves a real problem (install intent matching was ~21% recall)
- Well-researched design (RRF, late chunking, asymmetric encoding)
- High code quality (tests, docs, error handling)
- Production-ready (bounds, caching, thread safety)
- Air-gap compatible (critical requirement)
Impact: Recall@1 goes from 21% to 100% on the validation set. The frontend integration will dramatically improve the agent's ability to surface relevant missions for install/troubleshoot/optimize intents.
Next Steps
- ✅ Merge this PR - it's ready
- Welcome @bmvinay7 to the contributors list - add to ADOPTERS.md or CONTRIBUTORS.md
- Track follow-ups - file issues for frontend integration, Netlify parity, MCP tool registration
Thank you for this excellent contribution!
|
@kubestellar-hive[bot]: changing LGTM is restricted to collaborators DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
📝 Summary of Changes
Mission retrieval was exact-slug matching (
web/src/lib/missions/intentMatcher.ts): it only resolved when the user named the project almost exactly, and only for install intent. This adds semantic retrieval over the full knowledge base (~1,600 CNCF install missions, fixers, runbooks) so a natural-language goal — "set up TLS certs", "reduce my cloud costs", "my pod gets permission denied" — returns the right mission.No external embedding API and no model download: everything runs in-process, so air-gapped deployments keep working and there are no new runtime dependencies. The architecture is inspired by
jina-embeddings-v5but is our own implementation:EmbedQueryIDF-weighted,EmbedPassagesublinear-TF)The default
Embedder(HashEmbedder) is deterministic signed feature hashing of word + char-trigram features. TheEmbedderinterface is the single seam where a neural model (ONNX / pure-Go transformer) can be dropped in later with zero retriever changes.Endpoint contract (reachable via the existing public read path, same as
/api/missions/browse):Changes Made
pkg/kb/rag/— model-free, in-process semantic retrieval engine (BM25 + dense hybrid, RRF fusion, late chunking, intent boosting, query expansion) with unit testsGET /api/missions/search(pkg/api/handlers/missions/search.go) — public read endpoint + documentedsearch_missionsagent-tool contract; records a KB gap viastore.RecordKBGapwhen nothing matchesscores.goto factor out a sharedfetchMissionIndexBytes(behavior-preserving; used by both scores and search)types.go) — lazily built fromfixes/index.json, rebuilt only when content changes (SHA-256 fingerprint)/searchinRegisterPublicRoutes(handler.go)Out of scope (follow-ups): Netlify
missions-searchparity for the hosted demo, frontend wiring of the install-intent fallback, and registeringsearch_missionsas a callable MCP tool.Checklist
git commit -s)Screenshots or Logs (if applicable)
Verification
go build ./...→ exit 0 ·go vet ./...→ clean ·go test ./pkg/kb/rag/→ 10 tests passkindcluster — ran the realcmd/consolebinary on :8080 (connected tokind-kb-test) through the real auth middleware. Log:built semantic search index docs=1606.Edge cases: missing
q→400;k=100→ clamped to25.Impact: recall@1 on the set above ~21% → 100% (old slug matcher only handles near-exact install queries; semantic search also spans fixer/troubleshooting/runbook/cost missions). KB-gap rate and token-per-request savings are measurable via the existing
ListTopKBGapsand/api/token-usageinstrumentation.👀 Reviewer Notes
/api/missions/is already in the public-read allowlist, so no auth changes were needed; the endpoint reuses the existing GitHub-with-embedded-fallback index fetch.expand.go's concept map is the deliberate stand-in for neural generalization and is the seam to remove once anEmbedder-backed model is added.