Skip to content

Bounded Lookaround Implementation - Proof of Concept #585

@furkankoykiran

Description

@furkankoykiran

Hi RE2 maintainers and community,

I've been following the lookaround feature request (#156) and understand the design principles behind RE2, particularly regarding constructs that require backtracking.

I'd like to present a proof-of-concept implementation that attempts to address the core concerns while providing bounded lookaround functionality.

Goals

  1. Maintain O(n) time complexity - No exponential worst cases
  2. Respect RE2's design philosophy - No backtracking required
  3. Provide practical utility - Cover most common lookaround use cases
  4. Stay true to RE2's safety guarantees - Bounded memory, predictable performance

Implementation Approach

Bounded Lookaround:

  • Lookbehind: Limited to 255 characters backward (configurable at compile time)
  • Lookahead: Naturally bounded by remaining text length
  • Each lookaround compiles to a separate subprogram
  • Execution integrated into NFA engine (no DFA support)

Time Complexity: O(n × m × k) where:

  • n = input length
  • m = number of lookaround assertions
  • k = bounded subpattern complexity (max 255 chars)

Still linear in input length, but with higher constant factors.

Architecture:

Pattern: (?<!pre)main(?=post)
         ↓
Compilation: MainProg + SubProg[0](pre) + SubProg[1](post)
         ↓
Execution: NFA checks subprogs at lookaround instructions

Implementation Details

Fork: https://github.com/furkankoykiran/re2

Documentation: EXPERIMENTAL_FEATURES.md

Key Changes:

  • re2/parse.cc: +96 lines for parsing (?<!...), (?<=...), (?!...), (?=...)
  • re2/prog.h: New instruction types kInstLookBehind, kInstLookAhead
  • re2/compile.cc: Subprogram compilation with 255-char bound
  • re2/nfa.cc: +67 lines for lookaround execution in NFA
  • re2/regexp.h: +4 new RegexpOp enum values

Commit: furkankoykiran@6eab659

What Works

// Basic patterns
RE2("(?<!\\d)test");           // Match "test" not after digit
RE2("(?<=hello )world");       // Match "world" after "hello "
RE2("test(?!\\d)");            // Match "test" not before digit
RE2("test(?=\\d)");            // Match "test" before digit

// Complex patterns
RE2("(?=.*[0-9])(?=.*[a-z]).{8,}");  // Password validation
RE2("(?<!\\w)foo(?!\\w)");           // Word boundaries

Limitations

  1. Lookbehind bounded to 255 chars - Fundamental for O(n) guarantee
  2. No DFA support - Patterns with lookaround always use NFA
  3. Higher constant factors - 2-5x slower than equivalent non-lookaround patterns
  4. No OnePass optimization - Submatch extraction may be slower

Questions for Maintainers

  1. Is bounded lookaround philosophically acceptable?

    • It maintains O(n) time complexity
    • The bound (255 chars) covers 99% of practical cases
    • Trade-off: Limited but safe vs. not available at all
  2. Would upstream consider this approach?

    • Even as an opt-in experimental feature?
    • With clear documentation of limitations?
    • Perhaps with a different name (e.g., "bounded-lookaround")?
  3. Technical feedback welcome:

    • Is the 255-char bound reasonable?
    • Are there edge cases I'm missing?
    • Better implementation approaches?

Community Interest

Looking at #156:

  • 61 👍 reactions on the feature request
  • 96 👎 reactions on rejection comment
  • Multiple users citing real use cases
  • Several "+1" comments over the years

This suggests significant community interest, though I understand that popularity doesn't override design principles.

My Position

I'm not trying to argue that RE2 maintainers were wrong to reject unbounded lookaround. The concerns about exponential complexity are valid and important.

Instead, I'm asking: Could bounded lookaround be a middle ground?

  • Maintains safety guarantees ✅
  • Maintains O(n) complexity ✅
  • Covers most practical use cases ✅
  • Higher constant factors (documented trade-off)
  • Limited lookbehind range (255 chars)

Looking for Feedback

This is a proof of concept. I'm seeking:

  1. Technical review: Is the implementation sound?
  2. Design feedback: Better approaches to achieve this?
  3. Philosophical input: Does bounded lookaround fit RE2's goals?
  4. Community input: Would this be useful despite limitations?

References

Thank You

Thank you for creating and maintaining RE2. It's an amazing library and the design principles are admirable. Even if this approach isn't suitable for upstream, I hope the discussion is valuable.


Note: This is posted as an issue to gather feedback, not as a pull request. I understand this may not be suitable for upstream, and that's okay. The fork will continue as an experimental branch either way.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions