Skip to content

fix: backtracking in CSS tokenizer rules#3626

Merged
flavorjones merged 3 commits intomainfrom
regex-backtracking-redos
Apr 27, 2026
Merged

fix: backtracking in CSS tokenizer rules#3626
flavorjones merged 3 commits intomainfrom
regex-backtracking-redos

Conversation

@flavorjones
Copy link
Copy Markdown
Member

What problem is this PR intended to solve?

Addresses three regular expression backtracking / redos issues.

ref: GHSA-c4rq-3m3g-8wgx

Have you included adequate test coverage?

Yes, added benchmark suite tests to assert linearity of regex performance.

Does this change affect the behavior of either the C or the Java implementations?

N/A

@flavorjones flavorjones force-pushed the regex-backtracking-redos branch from d9c535d to 82f06ea Compare April 27, 2026 15:04
The STRING rule had two ambiguities that caused exponential
backtracking on unterminated quoted-string input:

1. The body's negated class `[^\n\r\f"]` matched a literal `\`,
   overlapping with the {escape} branch. Input like
   `[foo="\a\a\a...` had 2**N parses for N pairs.

2. {unicode}'s `[0-9A-Fa-f]{1,6}` admitted six match lengths
   per escape position. Input like `\aaaaaa\aaaaaa...` had
   6**N parses.

When the closing quote was missing the engine enumerated every
parse before failing, so a sub-100-byte payload could hang the
process indefinitely.

The fix:

- Excludes `\` from the body's negated class, so backslashes
  can only enter via {escape}, removing the cross-branch
  ambiguity.
- Wraps the body alternation in an atomic group `(?>...)*` to
  lock each iteration's match decision, removing the
  within-escape length ambiguity.
- Adds `\\?{nl}` for CSS line continuation, previously absorbed
  by the loose negated class.
- Drops the `(?<!\\)(?:\\{2})*` bookkeeping that existed only
  to recover from the original ambiguity.

Adds two performance benchmarks asserting linear parse time
for both ambiguity classes.

ref: GHSA-c4rq-3m3g-8wgx
A second instance of the same backtracking pattern: `{unicode}`'s
`[0-9A-Fa-f]{1,6}` admits six match lengths per escape position,
and {nmchar} appears under `*` in {name}. When the `{ident}\({w}`
rule fails (no `(` after an identifier-shaped prefix), the engine
backtracks through `{nmchar}*` for 6**N parses. Payload
`\aaaaaa\aaaaaa...X` triggers it: at n=8 it takes 330ms, at n=10
it takes 11.4s.

Wrap the body alternations of {nmchar} and {nmstart} in atomic
groups, mirroring the prior STRING-rule fix. Each nmchar/nmstart
match is locked once committed, so the outer `{nmchar}*` can
release whole iterations but cannot try alternative inner
consumption of the {1,6} hex run.

Add a benchmark test asserting linear time, similar to previous.

ref: GHSA-c4rq-3m3g-8wgx
@flavorjones flavorjones force-pushed the regex-backtracking-redos branch from 82f06ea to 9bada21 Compare April 27, 2026 15:05
JRuby's JIT warmup variance makes per-call timings too noisy for
the R**2 >= 0.99 linearity assertion. Observed CI failures with
R**2 around 0.94-0.97 even though the regex itself is unchanged
between engines.

The ReDoS property is determined by the regex, not the engine
(Joni and Onigmo implement the same matching semantics), so MRI
coverage is sufficient evidence the fixes hold.
@flavorjones flavorjones merged commit 52feb61 into main Apr 27, 2026
169 checks passed
@flavorjones flavorjones deleted the regex-backtracking-redos branch April 27, 2026 17:35
flavorjones added a commit that referenced this pull request Apr 27, 2026
**What problem is this PR intended to solve?**

Backport of #3626 to
v1.19.x
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant