Skip to content

Skip finding subcaptures when position is known from top level match in RegexOptions.NonBacktracking #68783

Open
@olsaarik

Description

@olsaarik

The match generation algorithm for RegexOptions.NonBacktracking uses a third pass over the input whenever the pattern has subcaptures and the user call need to know them. This third pass is an NFA simulation with additional bookkeeping to record subcapture starts and ends, and as such is quite a bit slower than the first two passes.

For a patterns like (fo+)|(bar) the third pass could be skipped by detecting which subpattern, fo+ or bar, actually matched, since in both cases only one subcapture matches and also coincides with the start and end of the whole pattern (which are found by the second and first pass, respectively). This optimization could also apply to patterns with fixed-length prefixes like abc((fo+)|(bar)).

The likely way to implement this would be to add something similar to the markers appended to fixed-length paths in patterns to skip the second pass (e.g. in abc((fo+)|(bar)) the bar option would be marked to be length 6). The new markers would indicate "if you end the match in this state, then all relevant subcapture positions are known and they occur at these offsets from the top level start and end positions".

We know at least one important use case for this optimization where the customer has a large corpus of patterns and they want to find if any of them match and if so which ones. Using a big alternation of named captures, like (?<name1>R1)|(?<name2>R2)|(?<name3>R3)|... together with RegexOptions.ExplicitCapture they can match all patterns together (getting great performance with the automata based engine of RegexOptions.NonBacktracking) and still find which pattern is the matching one for further processing.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions