Skip to content

Consider transforming regexes for better finite automaton compatibility #91

@masklinn

Description

@masklinn

I understand that Go's regexp packages uses finite automaton. Working on uap-rust, I learned that uap-core's extensive use of large bounded repetition (.{x,y} with a large y) causes a state explosion in FA engines (which makes sense), you can see some of it at rust-lang/regex#1206 (reply in thread) as I asked the rust-regex author for assistance (I was investigating the large memory use of my rust-regex-based implementation compared to a C++ re2 implementation).

Assuming Go's regexp is inspired by or derived from re2, it should be much less sensible than rust-regexp, and should not have the performance traps / cliffs of jitted DFA, but I would still expect that the bounded repetitions significantly increase memory use for no value.

Experimentally, in uap-rust I found that converting repetitions with an upper bound of 100 or more (so 3+ digits) had no functional implication, with all of those having been added to mitigate catastrophic backtracking. Below 100 things are more risky, I found at least one regex which uses a functional repetition with an upper bound of 50.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions