feat: Zero-VDB GitHub-native + BM25 search backend#121
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (17)
📝 WalkthroughWalkthroughThis PR implements a zero-infrastructure native search backend for issue/PR similarity detection, replacing mandatory Qdrant and embedding API dependencies with GitHub's hybrid search API plus local BM25 re-ranking. All operations use only the standard GitHub Token, eliminating external vector database and embedding service requirements. ChangesGitHub Native Search Backend with BM25 Fallback
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Add SearchConfig struct with Backend ("qdrant"|"github_native"|"bm25")
and BM25Fallback fields. Default backend is "qdrant" for backward compat.
Validate() now skips Qdrant/embedding key checks for non-qdrant backends.
mergeConfigs() propagates Search fields from child config.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
Searcher wraps /search/issues?search_type=hybrid using the go-github client for oauth2 auth. Returns SearchHit slice and a rateLimited bool so callers can fall back gracefully without treating rate-limits as errors. searchIssuesRaw is an unexported helper used by unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
Add GitHubSearcher *github.Searcher to Dependencies for injection into github_similarity step. Add issue-triage-github and similarity-only-github presets that replace the Qdrant steps with github_similarity. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
Pure-function BM25 Okapi implementation (~100 lines, no new dependency). tokenize() lowercases, strips punctuation, and filters tokens < 2 chars. bm25Score() returns [0,1] normalized scores with k1=1.5 b=0.75. Used by github_similarity for re-ranking GitHub search candidates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
GitHubSimilarity uses GitHub hybrid search as tier 1 and BM25 over ListIssues as tier 2 fallback (rate-limit or empty results). BM25 corpus is capped at 500 issues. Results are normalized [0,1] and filtered by similarity_threshold before populating ctx.SimilarIssues. Step is registered as "github_similarity" and used by the new issue-triage-github and similarity-only-github presets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
process.go and batch.go: skip embedder initialization when search.backend != "qdrant" (prevents hard failure for users without an embedding API key). Initialize GitHubSearcher when backend is "github_native" or "bm25". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
The --workflow flag defaulted to "issue-triage" hardcoded, always overriding whatever workflow: was set in the config file. Change the default to "" so process.go falls through to cfg.Workflow, letting users set workflow: issue-triage-github in their simili.yaml without needing a CLI flag. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
Update all config files and the triage workflow to use the new github_native search backend (workflow: issue-triage-github). Changes: - .github/simili.yaml: remove qdrant/embedding blocks, add search config - .simili.yaml: same for local dev - .github/workflows/triage.yml: remove QDRANT_URL and QDRANT_API_KEY secrets - DOCS/examples/*/simili.yaml: update both examples; keep qdrant as commented legacy block similarity_threshold lowered to 0.15 (appropriate for BM25 vs 0.70 for cosine). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
SearchIssues now accepts an itemType param ("issue"|"pr"|"") so the
GitHub hybrid search query uses is:issue or is:pr appropriately.
fetchAllIssues filters the ListIssues result to match itemType.
github_similarity sets itemType from ctx.Issue.EventType so PR events
find similar PRs and issue events find similar issues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
Signed-off-by: Kaviru Hapuarachchi <kavirurh@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Satisfies errcheck linter requirement. Signed-off-by: Kaviru Hapuarachchi <kavirurh@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
9f722af to
7b425b9
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
internal/integrations/github/searcher_test.go (1)
70-70: 💤 Low valueCheck the error return from
w.Write.While this is in test code and unlikely to fail, checking the error return value is good practice and satisfies the linter.
✨ Proposed fix
- w.Write([]byte(`{"message":"API rate limit exceeded"}`)) + _, _ = w.Write([]byte(`{"message":"API rate limit exceeded"}`))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/integrations/github/searcher_test.go` at line 70, The test currently calls w.Write([]byte(`{"message":"API rate limit exceeded"}`)) without checking the returned error; change this to capture and assert the error from w.Write (e.g. _, err := w.Write(...); require.NoError(t, err) or if not using testify then if err != nil { t.Fatalf("w.Write failed: %v", err) }) in internal/integrations/github/searcher_test.go to satisfy the linter and ensure write failures are surfaced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/simili/commands/batch.go`:
- Around line 337-343: The config Validate() function should enforce that
Search.Backend is one of the allowed values; add a whitelist check inside
Validate() for Search.Backend (valid values: "qdrant", "github_native", "bm25")
and return a clear error when it is not valid; specifically update
internal/core/config/config.go in the Validate() method to read
cfg.Search.Backend and if it is not one of those three strings return an error
like "invalid Search.Backend: %q; must be one of: qdrant, github_native, bm25"
so downstream code (e.g., the GitHub searcher initialization) cannot proceed
with a misspelled backend.
In `@internal/integrations/github/searcher.go`:
- Around line 98-101: The type inference in the searcher uses item.PullRequest
!= nil to set t to "pr" but the test helper searchIssuesRaw additionally checks
strings.TrimSpace(item.PullRequest.URL) != "" causing inconsistency; make them
consistent by removing the extra URL non-empty check from searchIssuesRaw so
both rely on item.PullRequest != nil (refer to the item.PullRequest check and
the searchIssuesRaw helper) and update tests accordingly.
In `@internal/steps/github_similarity.go`:
- Around line 192-227: The loop appends hits to the slice all and only checks
len(all) >= bm25CorpusCap after appending (and already breaks when resp.NextPage
== 0), which allows all to exceed bm25CorpusCap and leaves the final if-block
dead; change the logic in the issues iteration (the for _, iss := range issues
loop) to check whether len(all) >= bm25CorpusCap before appending each
githubpkg.SearchHit (or compute remaining capacity and only append up to that
capacity), remove the unreachable trailing if block that logs and breaks (the
duplicate len(all) >= bm25CorpusCap block), and ensure pagination
(resp.NextPage) still breaks normally; reference variables/functions: all,
bm25CorpusCap, issues loop, iss, resp.NextPage, and githubpkg.SearchHit.
---
Nitpick comments:
In `@internal/integrations/github/searcher_test.go`:
- Line 70: The test currently calls w.Write([]byte(`{"message":"API rate limit
exceeded"}`)) without checking the returned error; change this to capture and
assert the error from w.Write (e.g. _, err := w.Write(...); require.NoError(t,
err) or if not using testify then if err != nil { t.Fatalf("w.Write failed: %v",
err) }) in internal/integrations/github/searcher_test.go to satisfy the linter
and ensure write failures are surfaced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a1edd0ef-b0bb-45f4-95f2-5988c42e37db
📒 Files selected for processing (17)
.github/simili.yaml.github/workflows/triage.yml.simili.yamlDOCS/examples/multi-repo/simili.yamlDOCS/examples/single-repo/simili.yamlcmd/simili/commands/batch.gocmd/simili/commands/process.gointernal/core/config/config.gointernal/core/config/config_test.gointernal/core/pipeline/registry.gointernal/integrations/github/searcher.gointernal/integrations/github/searcher_test.gointernal/steps/bm25.gointernal/steps/bm25_test.gointernal/steps/github_similarity.gointernal/steps/github_similarity_test.gointernal/steps/register.go
I, Kavirubc <hapuarachchikaviru@gmail.com>, hereby add my Signed-off-by to this commit: f11ee1b Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a zero-infrastructure similarity-search path that uses GitHub’s hybrid search API for candidate retrieval and an in-process BM25 scorer for re-ranking, making Qdrant/embedding optional while preserving backward compatibility via search.backend: qdrant.
Changes:
- Introduces a GitHub-native similarity step (
github_similarity) backed by/search/issues?search_type=hybridplus BM25 re-ranking (with ListIssues-based fallback). - Adds
search.backend/search.bm25_fallbackconfiguration, updates presets, and adjusts CLI wiring to initialize dependencies conditionally. - Updates repository and documentation configs/workflows to default to the GitHub-native backend and removes Qdrant secrets from Actions.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/steps/register.go | Registers the new github_similarity step. |
| internal/steps/github_similarity.go | Implements GitHub hybrid search → BM25 re-rank, with ListIssues BM25 fallback. |
| internal/steps/github_similarity_test.go | Adds basic tests for skip/dry-run behavior. |
| internal/steps/bm25.go | Adds a small in-process BM25 tokenizer/scorer. |
| internal/steps/bm25_test.go | Adds unit tests for tokenization and BM25 scoring behavior. |
| internal/integrations/github/searcher.go | Adds a Searcher wrapper for GitHub hybrid search and rate-limit detection. |
| internal/integrations/github/searcher_test.go | Adds tests for JSON decoding and a rate-limit-related check. |
| internal/core/pipeline/registry.go | Extends dependencies and adds new GitHub-native workflow presets. |
| internal/core/config/config.go | Adds SearchConfig, defaulting to qdrant; validation skips Qdrant requirements for non-qdrant backends. |
| internal/core/config/config_test.go | Adds tests for search defaults, YAML parsing, and validation behavior. |
| cmd/simili/commands/process.go | Respects workflow from config when flag unset; initializes GitHub searcher; gates embedder init by backend. |
| cmd/simili/commands/batch.go | Initializes GitHub searcher; gates embedder init by backend. |
| DOCS/examples/single-repo/simili.yaml | Updates example config to use github_native backend + new preset. |
| DOCS/examples/multi-repo/simili.yaml | Updates example config to use github_native backend + new preset. |
| .simili.yaml | Updates local dev config to use github_native. |
| .github/workflows/triage.yml | Removes Qdrant secrets for Actions workflow; documents optional LLM key. |
| .github/simili.yaml | Updates Actions config to use github_native backend + new preset. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Build a Searcher pointing at the test server via a plain http.Client. | ||
| // We test rate-limit detection using the raw helper since NewSearcher | ||
| // requires oauth2; the HTTP-level test uses a real server to verify header parsing. | ||
| req, _ := http.NewRequest(http.MethodGet, srv.URL+"/search/issues", nil) | ||
| resp, err := http.DefaultClient.Do(req) |
| func TestGitHubSimilarity_SkipOnTransferDetected(t *testing.T) { | ||
| called := false | ||
| step := &GitHubSimilarity{ | ||
| // Use nil searcher; the transfer skip should fire first. | ||
| searcher: nil, | ||
| } |
| func (c *Config) Validate() error { | ||
| requiredFields := []struct { | ||
| name string | ||
| envVar string | ||
| value string | ||
| }{ | ||
| {name: "qdrant.url", envVar: "QDRANT_URL", value: c.Qdrant.URL}, | ||
| {name: "qdrant.api_key", envVar: "QDRANT_API_KEY", value: c.Qdrant.APIKey}, | ||
| {name: "qdrant.collection", envVar: "QDRANT_COLLECTION", value: c.Qdrant.Collection}, | ||
| {name: "embedding.api_key", envVar: "EMBEDDING_API_KEY", value: c.Embedding.APIKey}, | ||
| } | ||
|
|
||
| for _, field := range requiredFields { | ||
| if strings.TrimSpace(field.value) == "" { | ||
| return fmt.Errorf( | ||
| "config validation failed: %s is empty (check %s environment variable)", | ||
| field.name, | ||
| field.envVar, | ||
| ) | ||
| // Qdrant and embedding API key are only required when using the qdrant backend. | ||
| if c.Search.Backend == "" || c.Search.Backend == "qdrant" { | ||
| required := []struct { |
- github_similarity: respect search.bm25_fallback config; backend=bm25 skips GitHub hybrid search and goes straight to ListIssues+BM25 - github_similarity: fix BM25 corpus cap — check before append, remove dead code block that could never be reached - searcher: unify PR detection to item.PullRequest != nil in both SearchIssues and searchIssuesRaw (remove inconsistent URL check) - searcher: remove unused strings import - bm25: fix doc comment (returns zero-filled slice, not nil) - config: validate Search.Backend against allowed values - config: add TestValidateRejectsUnknownBackend - process/batch: initialize embedder when embedding API key is present regardless of search backend Signed-off-by: Kavirubc <hapuarachchikaviru@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
I have test it using my fork repo and it is works fine. So this PR is LGTM |
Summary
Closes #120. Implements the zero-infrastructure search backend discussed in Discussion #112.
/search/issues?search_type=hybrid) + in-process BM25 Okapi re-rankingGITHUB_TOKEN(automatic in Actions) is requiredsearch.backend: qdrantWhat changed
feat(config)SearchConfigstruct withbackend/bm25_fallback;Validate()skips Qdrant checks for non-qdrant backendsfeat(github)Searcherwrapping/search/issues?search_type=hybridwith rate-limit detectionfeat(pipeline)GitHubSearcherinDependencies; newissue-triage-githubandsimilarity-only-githubpresetsfeat(steps/bm25)feat(steps/github_similarity)feat(cmd)GitHubSearcherwiring inprocess.goandbatch.gofix(cmd)--workflowflag no longer hardcodes"issue-triage", respectsworkflow:from config filefix(search)is:prfilter; issue events useis:issuechore(config)github_native;QDRANT_*secrets removed fromtriage.ymlHow it works
If GitHub search returns 0 results or hits the 10 req/min rate-limit, tier 2 runs BM25 directly over all open issues/PRs fetched from the list API (capped at 500).
Config to use the new backend
Test plan
go build ./...passesgo test ./...passes (all existing + new tests)go vet ./...passessimiligh/simili-botwithgithub_nativebackend confirmedgithub_similaritystep firesis:prfilter🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Configuration
Documentation