Skip to content

Tighten fuzzy edit-distance threshold for length-5 patterns#82459

Draft
CyrusNajmabadi wants to merge 13 commits intodotnet:mainfrom
CyrusNajmabadi:features/tighten-fuzzy-threshold
Draft

Tighten fuzzy edit-distance threshold for length-5 patterns#82459
CyrusNajmabadi wants to merge 13 commits intodotnet:mainfrom
CyrusNajmabadi:features/tighten-fuzzy-threshold

Conversation

@CyrusNajmabadi
Copy link
Contributor

@CyrusNajmabadi CyrusNajmabadi commented Feb 19, 2026

Followup to #82457

Summary

Changes WordSimilarityChecker.GetThreshold from length <= 4 ? 1 : 2 to length <= 5 ? 1 : 2, so that 5-character patterns now require edit distance ≤ 1 instead of ≤ 2 for a fuzzy match.

Motivation

The original threshold tiers (≤4 → 1, ≥5 → 2) were chosen by me 10-15 years ago based on gut feeling and a small amount of manual testing. They work well in practice, but the jump to k=2 at length 5 creates a blind spot in the bigram pre-filter introduced in #82457: the q-gram count lemma gives min_shared = 5 - 1 - 2*2 = 0, meaning the bigram filter cannot eliminate any candidates for 5-character patterns.

The industry standard for this exact problem is Elasticsearch/Lucene's fuzziness: AUTO setting, which uses:

Pattern length Max edit distance
1–2 0 (no fuzzy matching)
3–5 1
6+ 2

This is grounded in Damerau's finding that approximately 80% of human misspellings are edit distance 1, and that allowing 2 edits on short strings produces too many false positives (e.g., two edits can transform a 5-character word into something completely unrelated).

What changes

Only the length-5 case is affected:

Pattern length Before (k) After (k)
1–2 N/A (MinFuzzyLength = 3) N/A (unchanged)
3–4 1 1 (unchanged)
5 2 1
6+ 2 2 (unchanged)

MinFuzzyLength = 3 is deliberately kept — below that length, fuzzy matching produces too many spurious hits regardless of threshold.

Effect on pre-filter

With the new threshold, the bigram pre-filter becomes effective at length 5. The min_shared column shows the minimum number of pattern bigrams that must appear in the candidate for the q-gram count lemma to permit a match (|pattern| - 1 - 2k). Higher values mean stronger filtering:

Pattern length Before: k, min_shared After: k, min_shared Filtering strength
3 k=1, 0 k=1, 0 None (always passes)
4 k=1, 1 (of 3 bigrams) k=1, 1 (of 3 bigrams) Weak
5 k=2, 0 k=1, 2 (of 4 bigrams) None -> Moderate
6 k=2, 1 (of 5 bigrams) k=2, 1 (of 5 bigrams) Weak
7 k=2, 2 (of 6 bigrams) k=2, 2 (of 6 bigrams) Moderate
8 k=2, 3 (of 7 bigrams) k=2, 3 (of 7 bigrams) Strong
10+ k=2, increasingly strong k=2, increasingly strong Very strong

Risk

Exceptionally low. This is a one-line behavioral change that tightens a fuzzy match heuristic for a single pattern length. Fuzzy matching is already a best-effort "did you mean?" feature — slightly fewer fuzzy matches at length 5 is the expected and desired outcome.

Cyrus Najmabadi and others added 12 commits February 19, 2026 10:31
…e container matching always non-fuzzy

Move _allowFuzzyMatching from the base PatternMatcher class to SimplePatternMatcher, since
container matching is always non-fuzzy. Refactor MatchPatternChunk to try non-fuzzy first
then fuzzy as a fallback, removing the two-pass pattern from AddMatches. Remove
CreateContainerPatternMatcher (replaced by CreateDotSeparatedContainerMatcher which now
accepts nullable pattern). This is prerequisite for the NavigateTo prefilter work where the
prefilter tells the caller whether fuzzy matching is worth attempting.

Co-authored-by: Cursor <cursoragent@cursor.com>
Mechanical update of all call sites following the PatternMatcher refactor:
- DeclarationFinder: explicit allowFuzzyMatching: false, includeMatchedSpans: false
- DocumentOutlineViewModel: add missing using on PatternMatcher
- PatternMatcherTests: explicit allowFuzzyMatching: false

Co-authored-by: Cursor <cursoragent@cursor.com>
Change StringBreaker's public API from string to ReadOnlySpan<char> parameters (AddWordParts,
AddCharacterParts, AddParts, GenerateSpan) and all internal helpers. This enables span-based
processing in the new NavigateTo prefilter code without allocating substrings.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add ComputeHash(ReadOnlySpan<char>) and GetCharacter(ReadOnlySpan<char>) overloads.
Refactor ComputeHash(string) to delegate to the span overload. Simplify Add(char) and
ProbablyContains(char) to delegate to the span overloads. Add doc remark on
ProbablyContains(ReadOnlySpan<char>) explaining why it cannot share the caching
optimization from the string overload. This enables allocation-free bloom filter
operations in the NavigateTo prefilter.

Co-authored-by: Cursor <cursoragent@cursor.com>
Move the NavigateTo pre-filter data out of TopLevelSyntaxTreeIndex into its own
NavigateToSearchIndex (a new AbstractSyntaxIndex<NavigateToSearchIndex> subclass).
This allows the lightweight filter data to be loaded independently of the heavyweight
TopLevelSyntaxTreeIndex containing all declared symbols. Documents rejected by the
filter never need to load the full index.

Bump serialization format checksum from "51" to "52" to invalidate stale indices.

- New: NavigateToSearchIndex.cs, _Create.cs, _Persistence.cs
- TopLevelSyntaxTreeIndex: remove _navigateToSearchInfo field and related code

Co-authored-by: Cursor <cursoragent@cursor.com>
…grams, and length bitset

The core pre-filtering logic for NavigateTo, stored per-document in the NavigateToSearchIndex.
Contains five bloom filters and a length bitset for fast document-level rejection:

- _humpCharFilter: individual uppercased hump-initial characters (e.g. 'G','B' for "GooBar")
- _humpBigramFilter: all C(k,2) ordered pairs of hump initials (e.g. "GB","GQ","BQ" for
  "GooBarQuux"), enabling non-contiguous CamelCase matching
- _humpPrefixFilter: lowercased prefixes of each hump (e.g. "g","go","goo","b","ba","bar"
  for "GooBar"), used by a DP algorithm for all-lowercase patterns
- _trigramFilter: 3-char sliding windows for LowercaseSubstring matching (e.g. "line" in
  "Readline")
- _containerFilter: hump chars from fully-qualified container names
- _symbolNameLengthBitset: 64-bit bitset for fuzzy match length pre-filtering

For all-lowercase patterns, a DP algorithm splits the pattern into segments that each match
a stored hump prefix, avoiding the exponential capitalization enumeration.

Includes extensive documentation with examples throughout.

Co-authored-by: Cursor <cursoragent@cursor.com>
Integrate the lightweight NavigateToSearchIndex pre-filter into both the in-process and
cached document search paths:

- InProcess: load the filter index first via NavigateToSearchIndex.GetRequiredIndexAsync,
  call CouldContainNavigateToMatch to decide whether to load the full TopLevelSyntaxTreeIndex.
  Pass the allowFuzzyMatching signal from the filter to the PatternMatcher.
- CachedDocumentSearch: add s_cachedFilterIndexMap for the filter index, load it before the
  full index in the parallel ForEachAsync loop.
- Add LowercaseSubstring -> Fuzzy mapping to s_kindPairs.

Co-authored-by: Cursor <cursoragent@cursor.com>
~1030 lines of declarative theory-based tests covering all match kinds and pre-filter
behaviors. Organized into regions:

- Positive tests: verify CouldContainNavigateToMatch returns true for all supported
  PatternMatchKinds (Exact, Prefix, CamelCaseExact/Prefix/Substring, NonContiguous
  variants, StartOfWordSubstring, LowercaseSubstring, Fuzzy) with both mixed-case
  and all-lowercase patterns
- Negative tests: verify rejection when hump chars, bigrams, trigrams, and lengths
  don't match
- CrossHumpSubstring: documents NonLowercaseSubstring as intentionally not guaranteed
- Multiple symbols: document-level filter matches any symbol
- Fuzzy/non-fuzzy split: verify allowFuzzyMatching output signal
- Individual filter checks via TestAccessor (hump, DP, trigram, length)
- Container matching tests

Co-authored-by: Cursor <cursoragent@cursor.com>
End-to-end tests verifying that the NavigateToSearchIndex pre-filter correctly allows
matches through the full NavigateTo pipeline:

- CamelCase hump bigram (GB -> GooBar)
- All-lowercase DP hump prefix (goo -> GooBar)
- Trigram substring (line -> Readline)
- Fuzzy match enabled by length check (ToEror -> ToError)
- No match when all checks fail (XyzXyzXyzXyz)

Co-authored-by: Cursor <cursoragent@cursor.com>
BenchmarkDotNet benchmarks for measuring NavigateToSearchIndex pre-filter performance
across various pattern types (CamelCase, all-lowercase, trigram, fuzzy, container-qualified).

Co-authored-by: Cursor <cursoragent@cursor.com>
…th thresholds

The existing fuzzy pre-filter used only a symbol-name-length bitset with a
fixed ±2 delta, which was too coarse — in large files, many symbols share
similar lengths, causing false positives that force expensive full-scan
fuzzy matching.

This commit improves the pre-filter in three ways:

1. Fix LengthCheckPasses to use WordSimilarityChecker.GetThreshold (±1 for
   pattern lengths 3–4, ±2 for 5+) and reject patterns < MinFuzzyLength (3),
   matching the actual fuzzy matching behavior.

2. Add a 37×37 exact bigram bitset (176 bytes per document) storing all
   lowercased 2-character sliding windows of symbol names. At query time,
   use Ukkonen's q-gram count lemma to compute a minimum shared bigram
   count: min_shared = |pattern| - 1 - 2k. If fewer pattern bigrams match,
   fuzzy matching is skipped for that document.

3. Extract WordSimilarityChecker.GetThreshold(int) overload and MinFuzzyLength
   constant so both the pre-filter and the actual fuzzy matcher share the
   same threshold logic.

Reference: Ukkonen, E. (1992). "Approximate string-matching with q-grams
and maximal matches." Theoretical Computer Science, 92(1), 191–211.
https://doi.org/10.1016/0304-3975(92)90143-4

Co-authored-by: Cursor <cursoragent@cursor.com>
@dotnet-policy-service dotnet-policy-service bot added Community The pull request was submitted by a contributor who is not a Microsoft employee. VSCode labels Feb 19, 2026
@CyrusNajmabadi CyrusNajmabadi force-pushed the features/tighten-fuzzy-threshold branch 2 times, most recently from 5293f5b to 978b7f2 Compare February 19, 2026 13:20
Change GetThreshold from `length <= 4 ? 1 : 2` to `length <= 5 ? 1 : 2`,
aligning with Elasticsearch/Lucene's fuzziness:AUTO tiers (0 edits for 1-2
chars, 1 edit for 3-5 chars, 2 edits for 6+). This eliminates the bigram
pre-filter blind spot at length 5 where min_shared was 0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-IDE Community The pull request was submitted by a contributor who is not a Microsoft employee. VSCode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments