chore: improve search

rpgeeganage · rpgeeganage · commit 2397bd14b1e7 · 2026-03-31T17:22:33.000+01:00
diff --git a/.gitignore b/.gitignore
@@ -21,3 +21,4 @@ op*.txt
 b.sh
 .DS_Store
 .claude
+*.patch
diff --git a/README.md b/README.md
@@ -32,7 +32,7 @@ Several projects address MCP tool sprawl in different ways: [RAG-MCP](https://gi
 
 - **No infrastructure.** One Go binary, local SQLite. No Docker, no vector DB service, no cloud account.
 - **IDE auto-import.** Reads your Claude Desktop, Cursor, or VS Code MCP config. No manual YAML unless you want it.
-- **Three modes in one tool.** Search mode (5 meta-tools) for weak models, direct mode (transparent proxy) for strong models, hybrid for both. Switch with a flag.
+- **Three modes in one tool.** Direct mode (transparent proxy) for simple setups and smaller models, search mode (5 meta-tools) for large catalogs with strong models, hybrid for both. Switch with a flag.
 - **Provider-agnostic.** Not tied to Anthropic, OpenAI, or any specific client. Anything that speaks MCP over stdio or HTTP.
 - **Reliability built in.** Circuit breaking, caching, session reuse, and tracing handled at the proxy layer.
 
@@ -89,11 +89,11 @@ No IDE config? Write a YAML file manually — see [Configuration](#configuration
 
 ## When to use which mode
 
-- **Search mode** (default) — the agent sees 5 meta-tools and discovers capabilities through search. Reduces prompt size and improves tool selection for smaller/cheaper models (Haiku, GPT-4.1-mini, local Ollama).
+- **Direct mode** — every cataloged tool is exposed by name. The agent sees real schemas, lazy-tool routes transparently. Best for smaller/cheaper models (Haiku, GPT-4.1-mini, local Ollama) that struggle with multi-step reasoning. They get a simple tool list and call tools directly — one step, no search overhead. Also good for strong models that benefit from single-endpoint aggregation, circuit breaking, and caching.
 
-- **Direct mode** — every cataloged tool is exposed by name. The agent sees real schemas, lazy-tool routes transparently. For strong models that handle large tool lists fine but benefit from single-endpoint aggregation, circuit breaking, and caching.
+- **Search mode** (default) — the agent sees 5 meta-tools and discovers capabilities through search. Best for strong models (Claude, GPT-4, Llama 70B+) working with large tool catalogs (50+ tools) where dumping every schema into context wastes tokens and degrades selection accuracy. Requires the model to handle a two-step search→invoke pattern.
 
-- **Hybrid mode** — both search and direct tools available. Useful for gradual migration.
+- **Hybrid mode** — both search and direct tools available. Useful for gradual migration or mixed workloads.
 
 ```bash
 lazy-tool serve                  # search (default)
diff --git a/benchmark/README.md b/benchmark/README.md
@@ -52,13 +52,54 @@ Publishing honest benchmark claims is part of the project's reputation.
 
 ## Environment
 
-Recommended local environment:
+### Prerequisites
 
-- MCPJungle running locally (for baseline mode)
-- sample local MCPs registered
-- `lazy-tool` built from the repo root
-- valid `benchmark/configs/mcpjungle-lazy-tool.yaml`
+- Go 1.25+ (to build lazy-tool)
+- Node.js / npx (for the `everything` and `filesystem` MCP servers)
+- Python 3.11+ (for the benchmark harnesses)
+- [uv](https://docs.astral.sh/uv/) (recommended, for `mcp-server-time` via `uvx`)
 - At least one of: `GROQ_API_KEY`, `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`
+- For weak-model benchmarks: [Ollama](https://ollama.com) running locally with at least one model pulled
+
+### Setting up MCPJungle
+
+The benchmarks use [MCPJungle](https://github.com/mcpjungle/MCPJungle) as the upstream MCP gateway that hosts the test tools. Baseline mode connects directly to MCPJungle; search and direct modes connect through lazy-tool which indexes MCPJungle's catalog.
+
+**1. Install MCPJungle:**
+
+```bash
+# See https://github.com/mcpjungle/MCPJungle for full install instructions
+go install github.com/mcpjungle/mcpjungle@latest
+```
+
+**2. Start MCPJungle:**
+
+```bash
+mcpjungle serve
+# Default: http://127.0.0.1:8080/mcp (strong model suite)
+# Or configure a different port and pass --jungle-url to the benchmark scripts
+```
+
+**3. Register the sample MCP servers:**
+
+```bash
+./benchmark/mcpjungle-dev/register-samples.sh
+```
+
+This registers three MCP servers into MCPJungle:
+
+| Server | Transport | What it provides | Requires |
+|--------|-----------|-----------------|----------|
+| `everything` | stdio | echo tool, prompts, resources (MCP reference server) | npx |
+| `filesystem` | stdio | read/write/list tools scoped to `/tmp/lazy-tool-mcpjungle-fs` | npx |
+| `time` | stdio | time conversion tools | uvx |
+
+**4. Verify tools are registered:**
+
+```bash
+mcpjungle list tools
+# Should show tools from everything, filesystem, and time servers
+```
 
 ### Python dependencies
 
@@ -72,6 +113,14 @@ uv pip install --python benchmark/.venv/bin/python -r benchmark/requirements.txt
 pip install -r benchmark/requirements.txt
 ```
 
+### Weak-model setup (Ollama)
+
+```bash
+# Install Ollama: https://ollama.com
+ollama serve                    # start the server
+ollama pull qwen2.5:3b          # pull at least one model
+```
+
 ## Quick reproducible flow
 
 ### 1. Build and reindex
@@ -326,10 +375,18 @@ Keep raw artifacts around when updating public benchmark claims.
 ### `search_tools_smoke` returns zero hits
 
 Usually:
+- MCPJungle is not running or sample MCPs are not registered (see [Setting up MCPJungle](#setting-up-mcpjungle))
 - you forgot `reindex`
 - your source config is wrong
 - the indexed catalog is stale or empty
 
+Verify with:
+```bash
+export LAZY_TOOL_CONFIG=$PWD/benchmark/configs/mcpjungle-lazy-tool.yaml
+./bin/lazy-tool reindex
+./bin/lazy-tool search "echo" --limit 5
+```
+
 ### routed task chooses the wrong wrapper
 
 This is usually:
diff --git a/benchmark/run_weak_model_suite.sh b/benchmark/run_weak_model_suite.sh
@@ -28,6 +28,7 @@ LAZY_CONFIG=""
 JUNGLE_URL="http://127.0.0.1:8080/mcp"
 OLLAMA_URL="http://localhost:11434"
 SKIP_BUILD="false"
+SKIP_PREFLIGHT="false"
 MODELS=""
 TIER=""
 
@@ -40,6 +41,7 @@ while [[ $# -gt 0 ]]; do
     --jungle-url)     JUNGLE_URL="${2:?missing value}"; shift 2 ;;
     --ollama-url)     OLLAMA_URL="${2:?missing value}"; shift 2 ;;
     --skip-build)     SKIP_BUILD="true"; shift ;;
+    --skip-preflight) SKIP_PREFLIGHT="true"; shift ;;
     --models)         MODELS="${2:?missing value}"; shift 2 ;;
     --tier)           TIER="${2:?missing value}"; shift 2 ;;
     *)
@@ -125,6 +127,40 @@ LAZY_TOOL_CONFIG="$LAZY_CONFIG" "$LAZY_BINARY" reindex 2>&1 || {
   exit 1
 }
 
+# ── Preflight catalog check ──────────────────────────────────────────────
+# Verify the catalog has the expected tools before running benchmarks.
+# Without this, a broken MCPJungle setup silently produces meaningless results.
+
+if [[ "$SKIP_PREFLIGHT" == "true" ]]; then
+  echo "Preflight: skipped (--skip-preflight)"
+else
+
+echo "Preflight: verifying catalog..."
+PREFLIGHT_FAIL=""
+for query in "echo" "time"; do
+  HITS=$(LAZY_TOOL_CONFIG="$LAZY_CONFIG" "$LAZY_BINARY" search "$query" --limit 3 2>/dev/null \
+    | "$PYTHON" -c "import sys,json; d=json.load(sys.stdin); print(len(d.get('results',[])))" 2>/dev/null || echo "0")
+  if [[ "$HITS" == "0" ]]; then
+    PREFLIGHT_FAIL="${PREFLIGHT_FAIL}  - search '$query' returned 0 results\n"
+  else
+    echo "  search '$query': $HITS hit(s) — ok"
+  fi
+done
+
+if [[ -n "$PREFLIGHT_FAIL" ]]; then
+  echo "" >&2
+  echo "ERROR: Preflight catalog check failed:" >&2
+  echo -e "$PREFLIGHT_FAIL" >&2
+  echo "The catalog does not contain expected tools." >&2
+  echo "Check that MCPJungle is running and sample MCPs are registered:" >&2
+  echo "  benchmark/mcpjungle-dev/register-samples.sh" >&2
+  echo "Then re-run: LAZY_TOOL_CONFIG=$LAZY_CONFIG $LAZY_BINARY reindex" >&2
+  exit 1
+fi
+echo "Preflight passed."
+
+fi  # end skip-preflight guard
+
 # ── Prepare filesystem fixture ───────────────────────────────────────────
 
 FS_ROOT="/tmp/lazy-tool-mcpjungle-fs"
diff --git a/internal/search/candidate_path_test.go b/internal/search/candidate_path_test.go
@@ -21,9 +21,9 @@ func TestSearch_candidatePath_substringMatrix(t *testing.T) {
 	}
 	rows := []row{
 		{
-			name:  "fts_hit_skips_full_substring_scan",
+			name:  "fts_sparse_augments_with_substring",
 			query: "create github issue",
-			want:  models.SearchCandidatePathSubstringSkippedFTSHit,
+			want:  models.SearchCandidatePathSubstringAugmentedFTSSparse,
 			fixture: models.CapabilityRecord{
 				ID: "1", Kind: models.CapabilityKindTool, SourceID: "github-gateway", SourceType: "gateway",
 				CanonicalName: "github_gateway__create_issue", OriginalName: "create_issue",
@@ -57,6 +57,46 @@ func TestSearch_candidatePath_substringMatrix(t *testing.T) {
 		},
 	}
 
+	// When limit=1 and FTS returns 1 hit, substring scan is skipped (FTS has enough).
+	t.Run("fts_sufficient_skips_substring", func(t *testing.T) {
+		var mode string
+		prev := metrics.SearchCandidateGeneration
+		metrics.SearchCandidateGeneration = func(m string) { mode = m }
+		defer func() { metrics.SearchCandidateGeneration = prev }()
+
+		p := filepath.Join(t.TempDir(), "substr.db")
+		st, err := storage.OpenSQLite(p)
+		if err != nil {
+			t.Fatal(err)
+		}
+		defer func() { _ = st.Close() }()
+		ctx := context.Background()
+		rec := models.CapabilityRecord{
+			ID: "s1", Kind: models.CapabilityKindTool, SourceID: "github-gateway", SourceType: "gateway",
+			CanonicalName: "github_gateway__create_issue", OriginalName: "create_issue",
+			OriginalDescription: "Create an issue in a repo",
+			GeneratedSummary:    "Creates GitHub issues with title and body.",
+			SearchText:          "github-gateway create_issue repo title body issue",
+			VersionHash:         "h1", LastSeenAt: time.Now(),
+			InputSchemaJSON:     "{}", MetadataJSON: "{}",
+		}
+		if err := st.UpsertCapability(ctx, rec); err != nil {
+			t.Fatal(err)
+		}
+		svc := NewService(st, nil, embeddings.Noop{}, ScoreWeights{}, false)
+		// limit=1: FTS returns 1 hit which is >= limit, so substring is skipped
+		ranked, err := svc.Search(ctx, models.SearchQuery{Text: "create github issue", Limit: 1})
+		if err != nil {
+			t.Fatal(err)
+		}
+		if mode != models.SearchCandidatePathSubstringSkippedFTSHit {
+			t.Fatalf("metrics path: got %q want %q", mode, models.SearchCandidatePathSubstringSkippedFTSHit)
+		}
+		if ranked.CandidatePath != models.SearchCandidatePathSubstringSkippedFTSHit {
+			t.Fatalf("CandidatePath: got %q want %q", ranked.CandidatePath, models.SearchCandidatePathSubstringSkippedFTSHit)
+		}
+	})
+
 	for _, tc := range rows {
 		t.Run(tc.name, func(t *testing.T) {
 			var mode string
diff --git a/internal/search/e2e_test.go b/internal/search/e2e_test.go
@@ -3,6 +3,7 @@ package search
 import (
 	"context"
 	"path/filepath"
+	"strings"
 	"testing"
 	"time"
 
@@ -208,6 +209,64 @@ func TestService_Search_userSummaryBoost(t *testing.T) {
 	}
 }
 
+func TestService_Search_userSummaryContentMatchesLexical(t *testing.T) {
+	p := filepath.Join(t.TempDir(), "s.db")
+	st, err := storage.OpenSQLite(p)
+	if err != nil {
+		t.Fatal(err)
+	}
+	defer func() { _ = st.Close() }()
+	ctx := context.Background()
+
+	// Two tools with identical generated summaries. Only "b" has a user summary
+	// containing the search term "email". The lexical scorer should rank "b"
+	// higher because its effective summary matches the query.
+	a := models.CapabilityRecord{
+		ID: "a", Kind: models.CapabilityKindTool, SourceID: "s", SourceType: "gateway",
+		CanonicalName: "s__a", OriginalName: "a_tool",
+		GeneratedSummary: "generic helper utility",
+		SearchText:       "s a_tool generic helper utility email", VersionHash: "1", LastSeenAt: time.Now(),
+		InputSchemaJSON: "{}", MetadataJSON: "{}",
+	}
+	b := models.CapabilityRecord{
+		ID: "b", Kind: models.CapabilityKindTool, SourceID: "s", SourceType: "gateway",
+		CanonicalName: "s__b", OriginalName: "b_tool",
+		GeneratedSummary: "generic helper utility",
+		UserSummary:      "sends email notifications to users",
+		SearchText:       "s b_tool generic helper utility sends email notifications", VersionHash: "2", LastSeenAt: time.Now(),
+		InputSchemaJSON: "{}", MetadataJSON: "{}",
+	}
+	if err := st.UpsertCapability(ctx, a); err != nil {
+		t.Fatal(err)
+	}
+	if err := st.UpsertCapability(ctx, b); err != nil {
+		t.Fatal(err)
+	}
+	svc := NewService(st, nil, embeddings.Noop{}, DefaultScoreWeights(), false)
+	out, err := svc.Search(ctx, models.SearchQuery{Text: "email", Limit: 5})
+	if err != nil {
+		t.Fatal(err)
+	}
+	if len(out.Results) < 1 {
+		t.Fatal("expected at least 1 result")
+	}
+	// "b" should rank first because its effective summary (user summary) contains "email"
+	if out.Results[0].CapabilityID != b.ID {
+		t.Fatalf("user summary content should boost relevance; want b first, got %+v", out.Results)
+	}
+	// Verify the summary match signal is present
+	found := false
+	for _, w := range out.Results[0].WhyMatched {
+		if strings.Contains(w, "summary:") {
+			found = true
+			break
+		}
+	}
+	if !found {
+		t.Fatalf("expected summary: signal in why_matched: %v", out.Results[0].WhyMatched)
+	}
+}
+
 func TestService_Search_noopEmbeddingsNoPanic(t *testing.T) {
 	p := filepath.Join(t.TempDir(), "s.db")
 	st, err := storage.OpenSQLite(p)
diff --git a/internal/search/scoring.go b/internal/search/scoring.go
@@ -20,7 +20,7 @@ func scoreLexical(needle string, tokens []string, rec *models.CapabilityRecord)
 	}
 
 	on := strings.ToLower(rec.OriginalName)
-	sum := strings.ToLower(rec.GeneratedSummary)
+	sum := strings.ToLower(rec.EffectiveSummary())
 	src := strings.ToLower(rec.SourceID)
 
 	if needle != "" {
diff --git a/internal/search/service.go b/internal/search/service.go
@@ -376,9 +376,10 @@ func (s *Service) buildCandidates(ctx context.Context, q models.SearchQuery, nee
 		}
 	}
 
-	// Substring scan over the full catalog: only when FTS did not already return hits.
-	// If BM25 returned candidates, repeating a per-row substring pass is redundant for normal queries.
-	if match != "" && len(ftsIDs) > 0 {
+	// Substring scan over the full catalog: skip only when FTS returned enough candidates
+	// to fill the request. When FTS returns sparse results (fewer than limit), augment with
+	// substring scan so that near-matches are not lost to BM25 ranking gaps.
+	if match != "" && len(ftsIDs) >= q.Limit {
 		metrics.SearchCandidateGeneration(models.SearchCandidatePathSubstringSkippedFTSHit)
 		return out, models.SearchCandidatePathSubstringSkippedFTSHit, nil
 	}
@@ -388,9 +389,14 @@ func (s *Service) buildCandidates(ctx context.Context, q models.SearchQuery, nee
 		return out, models.SearchCandidatePathFullCatalogSubstringDisabled, nil
 	}
 
-	subPath := models.SearchCandidatePathSubstringFullCatalogFTSZeroRows
-	if match == "" {
+	var subPath string
+	switch {
+	case match == "":
 		subPath = models.SearchCandidatePathSubstringFullCatalogNoFTSMatch
+	case len(ftsIDs) == 0:
+		subPath = models.SearchCandidatePathSubstringFullCatalogFTSZeroRows
+	default:
+		subPath = models.SearchCandidatePathSubstringAugmentedFTSSparse
 	}
 	metrics.SearchCandidateGeneration(subPath)
 	subIDs, err := s.Store.ListIDsBySearchTextSubstring(ctx, needle, q.SourceIDs)
diff --git a/internal/storage/fts.go b/internal/storage/fts.go
@@ -40,7 +40,12 @@ func tagsJoined(rec models.CapabilityRecord) string {
 	return strings.Join(rec.Tags, " ")
 }
 
-// ftsTokenize splits a query into FTS-safe tokens (letters/digits runs, min length 2). Aligns with search tokenization.
+// ftsTokenize splits a query into FTS-safe tokens (letters/digits runs, min length 2).
+// Single-char tokens are dropped because they produce excessive FTS matches across the
+// entire catalog without adding discriminative value. The FTS5 porter unicode61 tokenizer
+// does index single-char tokens, but querying on them returns too many false positives.
+// When ftsTokenize returns no tokens (e.g. single-letter query), BuildFTSMatchQuery returns
+// "" and the search pipeline falls back to substring scan, which handles short queries fine.
 func ftsTokenize(s string) []string {
 	s = strings.ToLower(s)
 	var cur strings.Builder
diff --git a/internal/storage/fts_test.go b/internal/storage/fts_test.go
@@ -170,3 +170,22 @@ func TestGetCapabilitiesByIDs(t *testing.T) {
 		t.Fatalf("%+v", m)
 	}
 }
+
+func TestFTS_singleCharQueryReturnsEmptyMatch(t *testing.T) {
+	// Single-char queries produce empty FTS MATCH strings by design.
+	// The search pipeline falls back to substring scan for these.
+	match := BuildFTSMatchQuery("a")
+	if match != "" {
+		t.Fatalf("single-char query should produce empty match, got %q", match)
+	}
+	// Two-char tokens should work normally.
+	match = BuildFTSMatchQuery("ab")
+	if match == "" {
+		t.Fatal("two-char query should produce non-empty match")
+	}
+	// Mixed: only 2+ char tokens survive.
+	match = BuildFTSMatchQuery("a bc d ef")
+	if match != `"bc" AND "ef"` {
+		t.Fatalf("want only 2+ char tokens, got %q", match)
+	}
+}
diff --git a/pkg/models/search.go b/pkg/models/search.go
@@ -2,7 +2,8 @@ package models
 
 // Lexical candidate-gathering paths (RankedResults.CandidatePath). Stable for logs and clients.
 const (
-	SearchCandidatePathSubstringSkippedFTSHit = "substring_scan_skipped_fts_hit"
+	SearchCandidatePathSubstringSkippedFTSHit      = "substring_scan_skipped_fts_hit"
+	SearchCandidatePathSubstringAugmentedFTSSparse = "substring_scan_augmented_fts_sparse"
 	// SearchCandidatePathSubstringFullCatalogNoFTSMatch is used when FTS MATCH is empty (no ≥2-char tokens), so BM25 is skipped and SQL substring match on search_text runs.
 	SearchCandidatePathSubstringFullCatalogNoFTSMatch = "substring_scan_full_catalog_no_fts_match"
 	// SearchCandidatePathSubstringFullCatalogFTSZeroRows is used when MATCH is non-empty but BM25 returned zero rows; SQL substring match on search_text runs as fallback.

-Original file line number
+Diff line change
 b.sh
 .DS_Store
 .claude
 +*.patch
Original file line number	Diff line number	Diff line change
`@@ -20,7 +20,7 @@ func scoreLexical(needle string, tokens []string, rec *models.CapabilityRecord)`
`20`	`20`	`}`
`21`	`21`
`22`	`22`	`on := strings.ToLower(rec.OriginalName)`
`23`		`- sum := strings.ToLower(rec.GeneratedSummary)`
	`23`	`+ sum := strings.ToLower(rec.EffectiveSummary())`
`24`	`24`	`src := strings.ToLower(rec.SourceID)`
`25`	`25`
`26`	`26`	`if needle != "" {`
Original file line number	Diff line number	Diff line change
`@@ -376,9 +376,10 @@ func (s *Service) buildCandidates(ctx context.Context, q models.SearchQuery, nee`
`376`	`376`	`}`
`377`	`377`	`}`
`378`	`378`
`379`		`- // Substring scan over the full catalog: only when FTS did not already return hits.`
`380`		`- // If BM25 returned candidates, repeating a per-row substring pass is redundant for normal queries.`
`381`		`- if match != "" && len(ftsIDs) > 0 {`
	`379`	`+ // Substring scan over the full catalog: skip only when FTS returned enough candidates`
	`380`	`+ // to fill the request. When FTS returns sparse results (fewer than limit), augment with`
	`381`	`+ // substring scan so that near-matches are not lost to BM25 ranking gaps.`
	`382`	`+ if match != "" && len(ftsIDs) >= q.Limit {`
`382`	`383`	`metrics.SearchCandidateGeneration(models.SearchCandidatePathSubstringSkippedFTSHit)`
`383`	`384`	`return out, models.SearchCandidatePathSubstringSkippedFTSHit, nil`
`384`	`385`	`}`
`@@ -388,9 +389,14 @@ func (s *Service) buildCandidates(ctx context.Context, q models.SearchQuery, nee`
`388`	`389`	`return out, models.SearchCandidatePathFullCatalogSubstringDisabled, nil`
`389`	`390`	`}`
`390`	`391`
`391`		`- subPath := models.SearchCandidatePathSubstringFullCatalogFTSZeroRows`
`392`		`- if match == "" {`
	`392`	`+ var subPath string`
	`393`	`+ switch {`
	`394`	`+ case match == "":`
`393`	`395`	`subPath = models.SearchCandidatePathSubstringFullCatalogNoFTSMatch`
	`396`	`+ case len(ftsIDs) == 0:`
	`397`	`+ subPath = models.SearchCandidatePathSubstringFullCatalogFTSZeroRows`
	`398`	`+ default:`
	`399`	`+ subPath = models.SearchCandidatePathSubstringAugmentedFTSSparse`
`394`	`400`	`}`
`395`	`401`	`metrics.SearchCandidateGeneration(subPath)`
`396`	`402`	`subIDs, err := s.Store.ListIDsBySearchTextSubstring(ctx, needle, q.SourceIDs)`