Separate fuzzy pattern matching into dedicated FuzzyPatternMatcher#82458
Draft
CyrusNajmabadi wants to merge 14 commits intodotnet:mainfrom
Draft
Separate fuzzy pattern matching into dedicated FuzzyPatternMatcher#82458CyrusNajmabadi wants to merge 14 commits intodotnet:mainfrom
CyrusNajmabadi wants to merge 14 commits intodotnet:mainfrom
Conversation
…e container matching always non-fuzzy Move _allowFuzzyMatching from the base PatternMatcher class to SimplePatternMatcher, since container matching is always non-fuzzy. Refactor MatchPatternChunk to try non-fuzzy first then fuzzy as a fallback, removing the two-pass pattern from AddMatches. Remove CreateContainerPatternMatcher (replaced by CreateDotSeparatedContainerMatcher which now accepts nullable pattern). This is prerequisite for the NavigateTo prefilter work where the prefilter tells the caller whether fuzzy matching is worth attempting. Co-authored-by: Cursor <cursoragent@cursor.com>
Mechanical update of all call sites following the PatternMatcher refactor: - DeclarationFinder: explicit allowFuzzyMatching: false, includeMatchedSpans: false - DocumentOutlineViewModel: add missing using on PatternMatcher - PatternMatcherTests: explicit allowFuzzyMatching: false Co-authored-by: Cursor <cursoragent@cursor.com>
Change StringBreaker's public API from string to ReadOnlySpan<char> parameters (AddWordParts, AddCharacterParts, AddParts, GenerateSpan) and all internal helpers. This enables span-based processing in the new NavigateTo prefilter code without allocating substrings. Co-authored-by: Cursor <cursoragent@cursor.com>
Add ComputeHash(ReadOnlySpan<char>) and GetCharacter(ReadOnlySpan<char>) overloads. Refactor ComputeHash(string) to delegate to the span overload. Simplify Add(char) and ProbablyContains(char) to delegate to the span overloads. Add doc remark on ProbablyContains(ReadOnlySpan<char>) explaining why it cannot share the caching optimization from the string overload. This enables allocation-free bloom filter operations in the NavigateTo prefilter. Co-authored-by: Cursor <cursoragent@cursor.com>
Move the NavigateTo pre-filter data out of TopLevelSyntaxTreeIndex into its own NavigateToSearchIndex (a new AbstractSyntaxIndex<NavigateToSearchIndex> subclass). This allows the lightweight filter data to be loaded independently of the heavyweight TopLevelSyntaxTreeIndex containing all declared symbols. Documents rejected by the filter never need to load the full index. Bump serialization format checksum from "51" to "52" to invalidate stale indices. - New: NavigateToSearchIndex.cs, _Create.cs, _Persistence.cs - TopLevelSyntaxTreeIndex: remove _navigateToSearchInfo field and related code Co-authored-by: Cursor <cursoragent@cursor.com>
…grams, and length bitset The core pre-filtering logic for NavigateTo, stored per-document in the NavigateToSearchIndex. Contains five bloom filters and a length bitset for fast document-level rejection: - _humpCharFilter: individual uppercased hump-initial characters (e.g. 'G','B' for "GooBar") - _humpBigramFilter: all C(k,2) ordered pairs of hump initials (e.g. "GB","GQ","BQ" for "GooBarQuux"), enabling non-contiguous CamelCase matching - _humpPrefixFilter: lowercased prefixes of each hump (e.g. "g","go","goo","b","ba","bar" for "GooBar"), used by a DP algorithm for all-lowercase patterns - _trigramFilter: 3-char sliding windows for LowercaseSubstring matching (e.g. "line" in "Readline") - _containerFilter: hump chars from fully-qualified container names - _symbolNameLengthBitset: 64-bit bitset for fuzzy match length pre-filtering For all-lowercase patterns, a DP algorithm splits the pattern into segments that each match a stored hump prefix, avoiding the exponential capitalization enumeration. Includes extensive documentation with examples throughout. Co-authored-by: Cursor <cursoragent@cursor.com>
Integrate the lightweight NavigateToSearchIndex pre-filter into both the in-process and cached document search paths: - InProcess: load the filter index first via NavigateToSearchIndex.GetRequiredIndexAsync, call CouldContainNavigateToMatch to decide whether to load the full TopLevelSyntaxTreeIndex. Pass the allowFuzzyMatching signal from the filter to the PatternMatcher. - CachedDocumentSearch: add s_cachedFilterIndexMap for the filter index, load it before the full index in the parallel ForEachAsync loop. - Add LowercaseSubstring -> Fuzzy mapping to s_kindPairs. Co-authored-by: Cursor <cursoragent@cursor.com>
~1030 lines of declarative theory-based tests covering all match kinds and pre-filter behaviors. Organized into regions: - Positive tests: verify CouldContainNavigateToMatch returns true for all supported PatternMatchKinds (Exact, Prefix, CamelCaseExact/Prefix/Substring, NonContiguous variants, StartOfWordSubstring, LowercaseSubstring, Fuzzy) with both mixed-case and all-lowercase patterns - Negative tests: verify rejection when hump chars, bigrams, trigrams, and lengths don't match - CrossHumpSubstring: documents NonLowercaseSubstring as intentionally not guaranteed - Multiple symbols: document-level filter matches any symbol - Fuzzy/non-fuzzy split: verify allowFuzzyMatching output signal - Individual filter checks via TestAccessor (hump, DP, trigram, length) - Container matching tests Co-authored-by: Cursor <cursoragent@cursor.com>
End-to-end tests verifying that the NavigateToSearchIndex pre-filter correctly allows matches through the full NavigateTo pipeline: - CamelCase hump bigram (GB -> GooBar) - All-lowercase DP hump prefix (goo -> GooBar) - Trigram substring (line -> Readline) - Fuzzy match enabled by length check (ToEror -> ToError) - No match when all checks fail (XyzXyzXyzXyz) Co-authored-by: Cursor <cursoragent@cursor.com>
BenchmarkDotNet benchmarks for measuring NavigateToSearchIndex pre-filter performance across various pattern types (CamelCase, all-lowercase, trigram, fuzzy, container-qualified). Co-authored-by: Cursor <cursoragent@cursor.com>
…th thresholds The existing fuzzy pre-filter used only a symbol-name-length bitset with a fixed ±2 delta, which was too coarse — in large files, many symbols share similar lengths, causing false positives that force expensive full-scan fuzzy matching. This commit improves the pre-filter in three ways: 1. Fix LengthCheckPasses to use WordSimilarityChecker.GetThreshold (±1 for pattern lengths 3–4, ±2 for 5+) and reject patterns < MinFuzzyLength (3), matching the actual fuzzy matching behavior. 2. Add a 37×37 exact bigram bitset (176 bytes per document) storing all lowercased 2-character sliding windows of symbol names. At query time, use Ukkonen's q-gram count lemma to compute a minimum shared bigram count: min_shared = |pattern| - 1 - 2k. If fewer pattern bigrams match, fuzzy matching is skipped for that document. 3. Extract WordSimilarityChecker.GetThreshold(int) overload and MinFuzzyLength constant so both the pre-filter and the actual fuzzy matcher share the same threshold logic. Reference: Ukkonen, E. (1992). "Approximate string-matching with q-grams and maximal matches." Theoretical Computer Science, 92(1), 191–211. https://doi.org/10.1016/0304-3975(92)90143-4 Co-authored-by: Cursor <cursoragent@cursor.com>
Extract edit-distance (fuzzy) matching from the shared PatternMatcher pipeline into a standalone FuzzyPatternMatcher class. Previously the allowFuzzyMatching bool threaded through 7 locations (factory, constructor, PatternSegment, TextChunk, AddMatches, MatchPatternSegment, MatchPatternChunk). Now each matcher has a single responsibility: - SimplePatternMatcher: exact, prefix, camelCase, substring (non-fuzzy) - FuzzyPatternMatcher: edit-distance only (WordSimilarityChecker) - ContainerPatternMatcher: dot-separated container matching (non-fuzzy) Callers compose them: try non-fuzzy first, fuzzy only as fallback — no redundant work. NavigateToSearchInfo now returns two independent signals (couldNonFuzzyMatch, couldFuzzyMatch) instead of a single bool. Also refactors base PatternMatcher: AddMatches is now non-abstract and handles SkipMatch centrally, delegating to abstract AddMatchesWorker. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the two-boolean prefilter signal (couldNonFuzzyMatch, couldFuzzyMatch)
and manual matcher composition with a [Flags] enum PatternMatcherKind
{ None, Standard, Fuzzy } that flows from the prefilter through the factory.
CreatePatternMatcher now accepts a PatternMatcherKind parameter:
- Standard only -> SimplePatternMatcher
- Fuzzy only -> FuzzyPatternMatcher
- Standard | Fuzzy -> CompoundPatternMatcher (tries each in order,
short-circuits on first match)
CompoundPatternMatcher takes ReadOnlySpan<PatternMatcher> and manages an
internal ArrayBuilder, freed on dispose. Callers no longer need to manually
compose matchers — they pass the enum and get the right thing back.
NavigateToSearchIndex.CouldContainNavigateToMatch now returns
PatternMatcherKind instead of bool + two out params.
Co-authored-by: Cursor <cursoragent@cursor.com>
6df5a50 to
cb3134a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Followup to #82457
Motivation
The
allowFuzzyMatchingboolean was threaded through 7 locations in the pattern matching pipeline, yet only 2 of 6 callers ever enabled it.ContainerPatternMatcherhardcoded it tofalseeverywhere. The interleaving of fuzzy and non-fuzzy logic made the code harder to follow than necessary.Design
Each matcher now has a single responsibility:
SimplePatternMatcher: exact, prefix, camelCase, substring (non-fuzzy)FuzzyPatternMatcher(new): edit-distance only viaWordSimilarityCheckerCompoundPatternMatcher(new): composes sub-matchers, tries each in order, short-circuits on first matchContainerPatternMatcher: unchangedA new
[Flags] enum PatternMatcherKind { None, Standard, Fuzzy }controls which strategies to use. The factory returns the appropriate matcher (or compound) based on the flags. Callers just pass the enum:NavigateToSearchIndex.CouldContainNavigateToMatchnow returnsPatternMatcherKind(instead ofbool+ two out-params), which flows directly into the factory.The base
PatternMatcher.AddMatchesis now non-abstract and handlesSkipMatchcentrally; subclasses implementAddMatchesWorker.Test plan
NavigateToSearchIndexTestspass