Description
Hello!
I am someone interested in regexes, so pomsky caught my eye. This project has a lot of good ideas (I especially like ranges), but there is one thing that really bugs me: How portable is pomsky?
Pomsky advertises itself as "a portable, modern regular expression language", but I don't see it as very portable.
Examples:
-
Predefined character classes are often defined very differently among languages. E.g.
\s
is defined as[ \t\n\x0B\f\r]
or[\p{IsWhite_Space}]
=[\x09-\x0d \x85\xa0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000]
in Java and as[\f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
in JavaScript. However, pomsky doesn't address those differences.[s]
compiles to\s
for both JavaScript and Java. -
%
compiles to\b
in JavaScript, which is a huge issue because it is not Unicode-aware. Documenting this doesn't make it more portable :( -
Forward references behave differently in different languages. Here is how the same regex behaves in Python and JavaScript:
>>> r = re.compile(r"^(?:(a)|\1\1){2}$") >>> r.fullmatch("a") is not None False >>> r.fullmatch("aa") is not None True >>> r.fullmatch("aaa") is not None True
> r = /^(?:(a)|\1\1){2}$/ > r.test("a") true > r.test("aa") true > r.test("aaa") false
This is because Python and JavaScript have different semantics for how capturing groups capture text. In JavaScript, entering a group (capturing or not) resets the captured text of all capturing groups in it. So when
(?:(a)|\1\1)
is entered in the second iteration, the captured text of(a)
is reset, which is why it does not accept"aaa"
. Python, on the other hand, only resets the captured text of a capturing group after the capturing group has captured a different part of the string.Python and JavaScript also differ in the semantics of backreferences in one edge case. If a backreference references a group that has no captured text (either because it has never captured text or because its captured text was reset), then the backreference will always reject (being equivalent to
(?!)
) in Python, and it will always accept (being equivalent to(?:)
) in JavaScript.
Since, pomsky doesn't seem to guarantee that regexes behave the same in different languages, what does portability mean for pomsky?
I am concerned with portability, because semantic differences can lead to security vulnerabilities.
E.g. ( [s]{2} | [ U+feff ] )+ $
will be vulnerable to exponential backtracking in some target language but not others. So if a developer uses a static analyzer or fuzzer to verify absence of exponential backtracking in one target language, they might assume that other languages are safe too.
Of course, semantic differences can cause problems in other ways too. E.g. if pomsky-generated regexes are used for input validation, then the JavaScript frontend might correctly filter out bad inputs but the Java backend does not.
Assuming that portability means "the generated regexes behave the same in all target languages", the above examples could be addressed as follows:
- Languages where predefined character classes don't match the semantics of pomsky need to be "polyfilled". E.g.
\s
can be emulated with the right character class across languages. If people actually want the\s
of their respective target language, then they can useregex \s
. %
should just behave the same in all target languages. It's unfortunate that JS regexes will be quite long, but long regexes are better than wrong regexes.- This is a really difficult problem. Making capturing groups and backreference behave consistently across language would likely require adding restrictions to how they can be used. This can be done via static analysis (example), but the additional restriction might annoy some users.