-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Hi RE2 maintainers and community,
I've been following the lookaround feature request (#156) and understand the design principles behind RE2, particularly regarding constructs that require backtracking.
I'd like to present a proof-of-concept implementation that attempts to address the core concerns while providing bounded lookaround functionality.
Goals
- Maintain O(n) time complexity - No exponential worst cases
- Respect RE2's design philosophy - No backtracking required
- Provide practical utility - Cover most common lookaround use cases
- Stay true to RE2's safety guarantees - Bounded memory, predictable performance
Implementation Approach
Bounded Lookaround:
- Lookbehind: Limited to 255 characters backward (configurable at compile time)
- Lookahead: Naturally bounded by remaining text length
- Each lookaround compiles to a separate subprogram
- Execution integrated into NFA engine (no DFA support)
Time Complexity: O(n × m × k) where:
- n = input length
- m = number of lookaround assertions
- k = bounded subpattern complexity (max 255 chars)
Still linear in input length, but with higher constant factors.
Architecture:
Pattern: (?<!pre)main(?=post)
↓
Compilation: MainProg + SubProg[0](pre) + SubProg[1](post)
↓
Execution: NFA checks subprogs at lookaround instructions
Implementation Details
Fork: https://github.com/furkankoykiran/re2
Documentation: EXPERIMENTAL_FEATURES.md
Key Changes:
re2/parse.cc: +96 lines for parsing(?<!...),(?<=...),(?!...),(?=...)re2/prog.h: New instruction typeskInstLookBehind,kInstLookAheadre2/compile.cc: Subprogram compilation with 255-char boundre2/nfa.cc: +67 lines for lookaround execution in NFAre2/regexp.h: +4 new RegexpOp enum values
Commit: furkankoykiran@6eab659
What Works
// Basic patterns
RE2("(?<!\\d)test"); // Match "test" not after digit
RE2("(?<=hello )world"); // Match "world" after "hello "
RE2("test(?!\\d)"); // Match "test" not before digit
RE2("test(?=\\d)"); // Match "test" before digit
// Complex patterns
RE2("(?=.*[0-9])(?=.*[a-z]).{8,}"); // Password validation
RE2("(?<!\\w)foo(?!\\w)"); // Word boundariesLimitations
- Lookbehind bounded to 255 chars - Fundamental for O(n) guarantee
- No DFA support - Patterns with lookaround always use NFA
- Higher constant factors - 2-5x slower than equivalent non-lookaround patterns
- No OnePass optimization - Submatch extraction may be slower
Questions for Maintainers
-
Is bounded lookaround philosophically acceptable?
- It maintains O(n) time complexity
- The bound (255 chars) covers 99% of practical cases
- Trade-off: Limited but safe vs. not available at all
-
Would upstream consider this approach?
- Even as an opt-in experimental feature?
- With clear documentation of limitations?
- Perhaps with a different name (e.g., "bounded-lookaround")?
-
Technical feedback welcome:
- Is the 255-char bound reasonable?
- Are there edge cases I'm missing?
- Better implementation approaches?
Community Interest
Looking at #156:
- 61 👍 reactions on the feature request
- 96 👎 reactions on rejection comment
- Multiple users citing real use cases
- Several "+1" comments over the years
This suggests significant community interest, though I understand that popularity doesn't override design principles.
My Position
I'm not trying to argue that RE2 maintainers were wrong to reject unbounded lookaround. The concerns about exponential complexity are valid and important.
Instead, I'm asking: Could bounded lookaround be a middle ground?
- Maintains safety guarantees ✅
- Maintains O(n) complexity ✅
- Covers most practical use cases ✅
- Higher constant factors (documented trade-off)
- Limited lookbehind range (255 chars)
Looking for Feedback
This is a proof of concept. I'm seeking:
- Technical review: Is the implementation sound?
- Design feedback: Better approaches to achieve this?
- Philosophical input: Does bounded lookaround fit RE2's goals?
- Community input: Would this be useful despite limitations?
References
- Original issue: Please support negative lookahead (?!) #156
- My fork: https://github.com/furkankoykiran/re2
- Full docs: EXPERIMENTAL_FEATURES.md
- Russ Cox's regex articles: https://swtch.com/~rsc/regexp/
Thank You
Thank you for creating and maintaining RE2. It's an amazing library and the design principles are admirable. Even if this approach isn't suitable for upstream, I hope the discussion is valuable.
Note: This is posted as an issue to gather feedback, not as a pull request. I understand this may not be suitable for upstream, and that's okay. The fork will continue as an experimental branch either way.