Improve nav-to fuzzy pre-filtering with q-gram bigram bitset and corrected length thresholds#82457
Conversation
…e container matching always non-fuzzy Move _allowFuzzyMatching from the base PatternMatcher class to SimplePatternMatcher, since container matching is always non-fuzzy. Refactor MatchPatternChunk to try non-fuzzy first then fuzzy as a fallback, removing the two-pass pattern from AddMatches. Remove CreateContainerPatternMatcher (replaced by CreateDotSeparatedContainerMatcher which now accepts nullable pattern). This is prerequisite for the NavigateTo prefilter work where the prefilter tells the caller whether fuzzy matching is worth attempting. Co-authored-by: Cursor <cursoragent@cursor.com>
Mechanical update of all call sites following the PatternMatcher refactor: - DeclarationFinder: explicit allowFuzzyMatching: false, includeMatchedSpans: false - DocumentOutlineViewModel: add missing using on PatternMatcher - PatternMatcherTests: explicit allowFuzzyMatching: false Co-authored-by: Cursor <cursoragent@cursor.com>
Change StringBreaker's public API from string to ReadOnlySpan<char> parameters (AddWordParts, AddCharacterParts, AddParts, GenerateSpan) and all internal helpers. This enables span-based processing in the new NavigateTo prefilter code without allocating substrings. Co-authored-by: Cursor <cursoragent@cursor.com>
Add ComputeHash(ReadOnlySpan<char>) and GetCharacter(ReadOnlySpan<char>) overloads. Refactor ComputeHash(string) to delegate to the span overload. Simplify Add(char) and ProbablyContains(char) to delegate to the span overloads. Add doc remark on ProbablyContains(ReadOnlySpan<char>) explaining why it cannot share the caching optimization from the string overload. This enables allocation-free bloom filter operations in the NavigateTo prefilter. Co-authored-by: Cursor <cursoragent@cursor.com>
Move the NavigateTo pre-filter data out of TopLevelSyntaxTreeIndex into its own NavigateToSearchIndex (a new AbstractSyntaxIndex<NavigateToSearchIndex> subclass). This allows the lightweight filter data to be loaded independently of the heavyweight TopLevelSyntaxTreeIndex containing all declared symbols. Documents rejected by the filter never need to load the full index. Bump serialization format checksum from "51" to "52" to invalidate stale indices. - New: NavigateToSearchIndex.cs, _Create.cs, _Persistence.cs - TopLevelSyntaxTreeIndex: remove _navigateToSearchInfo field and related code Co-authored-by: Cursor <cursoragent@cursor.com>
…grams, and length bitset The core pre-filtering logic for NavigateTo, stored per-document in the NavigateToSearchIndex. Contains five bloom filters and a length bitset for fast document-level rejection: - _humpCharFilter: individual uppercased hump-initial characters (e.g. 'G','B' for "GooBar") - _humpBigramFilter: all C(k,2) ordered pairs of hump initials (e.g. "GB","GQ","BQ" for "GooBarQuux"), enabling non-contiguous CamelCase matching - _humpPrefixFilter: lowercased prefixes of each hump (e.g. "g","go","goo","b","ba","bar" for "GooBar"), used by a DP algorithm for all-lowercase patterns - _trigramFilter: 3-char sliding windows for LowercaseSubstring matching (e.g. "line" in "Readline") - _containerFilter: hump chars from fully-qualified container names - _symbolNameLengthBitset: 64-bit bitset for fuzzy match length pre-filtering For all-lowercase patterns, a DP algorithm splits the pattern into segments that each match a stored hump prefix, avoiding the exponential capitalization enumeration. Includes extensive documentation with examples throughout. Co-authored-by: Cursor <cursoragent@cursor.com>
Integrate the lightweight NavigateToSearchIndex pre-filter into both the in-process and cached document search paths: - InProcess: load the filter index first via NavigateToSearchIndex.GetRequiredIndexAsync, call CouldContainNavigateToMatch to decide whether to load the full TopLevelSyntaxTreeIndex. Pass the allowFuzzyMatching signal from the filter to the PatternMatcher. - CachedDocumentSearch: add s_cachedFilterIndexMap for the filter index, load it before the full index in the parallel ForEachAsync loop. - Add LowercaseSubstring -> Fuzzy mapping to s_kindPairs. Co-authored-by: Cursor <cursoragent@cursor.com>
~1030 lines of declarative theory-based tests covering all match kinds and pre-filter behaviors. Organized into regions: - Positive tests: verify CouldContainNavigateToMatch returns true for all supported PatternMatchKinds (Exact, Prefix, CamelCaseExact/Prefix/Substring, NonContiguous variants, StartOfWordSubstring, LowercaseSubstring, Fuzzy) with both mixed-case and all-lowercase patterns - Negative tests: verify rejection when hump chars, bigrams, trigrams, and lengths don't match - CrossHumpSubstring: documents NonLowercaseSubstring as intentionally not guaranteed - Multiple symbols: document-level filter matches any symbol - Fuzzy/non-fuzzy split: verify allowFuzzyMatching output signal - Individual filter checks via TestAccessor (hump, DP, trigram, length) - Container matching tests Co-authored-by: Cursor <cursoragent@cursor.com>
End-to-end tests verifying that the NavigateToSearchIndex pre-filter correctly allows matches through the full NavigateTo pipeline: - CamelCase hump bigram (GB -> GooBar) - All-lowercase DP hump prefix (goo -> GooBar) - Trigram substring (line -> Readline) - Fuzzy match enabled by length check (ToEror -> ToError) - No match when all checks fail (XyzXyzXyzXyz) Co-authored-by: Cursor <cursoragent@cursor.com>
BenchmarkDotNet benchmarks for measuring NavigateToSearchIndex pre-filter performance across various pattern types (CamelCase, all-lowercase, trigram, fuzzy, container-qualified). Co-authored-by: Cursor <cursoragent@cursor.com>
…th thresholds The existing fuzzy pre-filter used only a symbol-name-length bitset with a fixed ±2 delta, which was too coarse — in large files, many symbols share similar lengths, causing false positives that force expensive full-scan fuzzy matching. This commit improves the pre-filter in three ways: 1. Fix LengthCheckPasses to use WordSimilarityChecker.GetThreshold (±1 for pattern lengths 3–4, ±2 for 5+) and reject patterns < MinFuzzyLength (3), matching the actual fuzzy matching behavior. 2. Add a 37×37 exact bigram bitset (176 bytes per document) storing all lowercased 2-character sliding windows of symbol names. At query time, use Ukkonen's q-gram count lemma to compute a minimum shared bigram count: min_shared = |pattern| - 1 - 2k. If fewer pattern bigrams match, fuzzy matching is skipped for that document. 3. Extract WordSimilarityChecker.GetThreshold(int) overload and MinFuzzyLength constant so both the pre-filter and the actual fuzzy matcher share the same threshold logic. Reference: Ukkonen, E. (1992). "Approximate string-matching with q-grams and maximal matches." Theoretical Computer Science, 92(1), 191–211. https://doi.org/10.1016/0304-3975(92)90143-4 Co-authored-by: Cursor <cursoragent@cursor.com>
Future consideration: fuzzy pre-filter effectiveness for short patternsThe bigram pre-filter (q-gram count lemma) has blind spots for short patterns due to the relationship between pattern length, edit distance threshold Current behavior (
|
| Pattern length | k | Length window | min_shared bigrams | Filtering power |
|---|---|---|---|---|
| 3 | 1 | ±1 (2–4) | 0 | None — always passes |
| 4 | 1 | ±1 (3–5) | 1 | Weak — need ≥1 of 3 |
| 5 | 2 | ±2 (3–7) | 0 | None — always passes |
| 6 | 2 | ±2 (4–8) | 1 | Weak — need ≥1 of 5 |
| 7 | 2 | ±2 (5–9) | 2 | Moderate |
| 8+ | 2 | ±2 | 3+ | Good |
Proposed: k=1 for length ≤ 6, k=2 for length ≥ 7
| Pattern length | k | Length window | min_shared bigrams | Filtering power |
|---|---|---|---|---|
| 3 | 1 | ±1 (2–4) | 0 | None — always passes |
| 4 | 1 | ±1 (3–5) | 1 | Weak — need ≥1 of 3 |
| 5 | 1 | ±1 (4–6) | 2 | Moderate — need ≥2 of 4 |
| 6 | 1 | ±1 (5–7) | 3 | Strong — need ≥3 of 5 |
| 7 | 2 | ±2 (5–9) | 2 | Moderate |
| 8+ | 2 | ±2 | 3+ | Good |
The key wins are at lengths 5 and 6, which go from zero/weak filtering to moderate/strong. The trade-off is less fuzzy tolerance (1 edit instead of 2) for 5–6 character patterns, but 2 edits on a 5-character string is 40% different — arguably too aggressive for useful fuzzy matching anyway.
Additionally, we should consider raising MinFuzzyLength from 3 to 4 (or even 5). A 3-letter fuzzy match with k=1 means "Foo" matches "Goo", "Boo", "For", "Fo", "Food", etc. — extremely permissive with zero bigram selectivity. Raising it would reduce noise with minimal loss of useful matches.
These are independent improvements and don't need to block this PR — just noting them for future work.
--
Note: i've implemented this change here: #82459
cbe4330 to
476f020
Compare
|
Be warned, this contributor is a known troll. |
| /// Maximum allowed edit distance for a fuzzy match given the source length. Shorter strings | ||
| /// get a tighter threshold (1) to avoid excessive spurious hits; longer strings get a looser | ||
| /// threshold (2) to tolerate more typos. | ||
| /// </summary> |
52b2bf4 to
f9b3637
Compare
f9b3637 to
80b1594
Compare
Followup to #82431
Summary
The NavigateTo pre-filter gates whether fuzzy (edit-distance) matching is attempted for a document. Previously it used only a symbol-name-length bitset with a fixed ±2 delta — too coarse in practice. In a large file with many 6-character symbols, searching for "FooBar" (also 6 characters) would pass the length check for every document, forcing an expensive full-scan fuzzy match against every symbol in those documents even when the character content is completely different.
This PR improves the fuzzy pre-filter in three ways:
Correct the length check thresholds to match what
WordSimilarityCheckeractually enforces: ±1 for pattern lengths 3–4, ±2 for 5+, and no fuzzy matching at all for patterns shorter than 3. Previously, the pre-filter used a blanket ±2 for all lengths and didn't reject short patterns, which meant it was more permissive than the actual fuzzy matcher.Add an exact bigram bitset (38×38 = 1444 bits, 184 bytes per document) storing all lowercased 2-character sliding windows of symbol names. At query time, use the q-gram count lemma (Ukkonen, 1992) to require a minimum number of shared bigrams:
min_shared = |pattern| - 1 - 2k, wherekis the edit distance threshold. If fewer pattern bigrams match, the document is skipped for fuzzy matching entirely.Share threshold logic by extracting
WordSimilarityChecker.GetThreshold(int)andMinFuzzyLengthso the pre-filter and the actual fuzzy matcher use the same constants instead of duplicating magic numbers.Motivation
Consider a codebase with thousands of documents, many containing 6-character symbol names. When the user types "FooBar" in NavigateTo:
Before: The length check passes for every document that has any symbol of length 4–8 (±2). The caller then creates a
PatternMatcherwith fuzzy matching enabled and runs it against every symbol in those documents. Most of these comparisons compute edit distance only to conclude there's no match.After: The length check passes (length 6, threshold ±2, checks 4–8). But the bigram check then asks: do at least 1 of the 5 bigrams ("fo","oo","ob","ba","ar") exist in this document? For documents with symbols like "XyzWvq", none of these bigrams are stored, so the document is skipped without ever creating a
PatternMatcher.Bigram bitset design
Characters are mapped to a 38-element alphabet:
a–z→ 0..250–9→ 26..35_→ 36This gives exact membership for the 37 most common identifier characters and a single overflow bucket for rare Unicode characters. The bitset is a
ulong[23](184 bytes) — compact enough to store per-document with negligible memory overhead.Filtering effectiveness by pattern length
Lengths 3 and 5 get no bigram filtering benefit (min_shared = 0). For length 3, we should strongly consider whether fuzzy matching itself is too permissive to even support — with an edit distance threshold of 1, a 3-character pattern like "abc" would fuzzy-match "xbc", "axc", "abx", "ab", "abcd", etc. This is a potential follow-up.
Reference
Ukkonen, E. (1992). "Approximate string-matching with q-grams and maximal matches." Theoretical Computer Science, 92(1), 191–211. DOI: 10.1016/0304-3975(92)90143-4
The q-gram count lemma states: each edit operation can destroy at most q q-grams from a string. Therefore, if
edit_distance(s, t) ≤ k, then at least|s| - 1 - q·kof s's q-gram positions must have a matching q-gram in t. For bigrams (q=2):min_shared = |pattern| - 1 - 2k.Test plan
LengthChecktheory tests for correct thresholds (pattern < 3 → false, 3–4 → ±1, 5+ → ±2)BigramCountChecktheory covering lengths 3–10, single-edit scenarios, underscore, digits, and UnicodeBigramCountCheck_MultipleSymbols— bigrams accumulate across all symbols in a documentBigramCountCheck_RejectsSameLengthDifferentContent— the key scenario: same length but disjoint charactersBigramCountCheck_UnicodeFallsBackToOtherBucket— Unicode chars share the "other" bucket (documented false positive)BigramCountCheck_UnderscoreHasOwnIndex— underscore is exact, not in the "other" bucketNavigateToFuzzyPreFilterBenchmarks.cswith 8 benchmarks demonstrating length-only false positives vs. bigram true negativescd src/Tools/IdeCoreBenchmarks && dotnet run -c Release -- --filter "NavigateToFuzzyPreFilterBenchmarks"Benchmarks: