fix: backtracking in CSS tokenizer rules (v1.19.x backport) by flavorjones · Pull Request #3627 · sparklemotion/nokogiri

flavorjones · 2026-04-27T16:53:36Z

What problem is this PR intended to solve?

Backport of #3626 to v1.19.x

The STRING rule had two ambiguities that caused exponential backtracking on unterminated quoted-string input: 1. The body's negated class `[^\n\r\f"]` matched a literal `\`, overlapping with the {escape} branch. Input like `[foo="\a\a\a...` had 2**N parses for N pairs. 2. {unicode}'s `[0-9A-Fa-f]{1,6}` admitted six match lengths per escape position. Input like `\aaaaaa\aaaaaa...` had 6**N parses. When the closing quote was missing the engine enumerated every parse before failing, so a sub-100-byte payload could hang the process indefinitely. The fix: - Excludes `\` from the body's negated class, so backslashes can only enter via {escape}, removing the cross-branch ambiguity. - Wraps the body alternation in an atomic group `(?>...)*` to lock each iteration's match decision, removing the within-escape length ambiguity. - Adds `\\?{nl}` for CSS line continuation, previously absorbed by the loose negated class. - Drops the `(?<!\\)(?:\\{2})*` bookkeeping that existed only to recover from the original ambiguity. Adds two performance benchmarks asserting linear parse time for both ambiguity classes. ref: GHSA-c4rq-3m3g-8wgx (cherry picked from commit 807f6ee)

A second instance of the same backtracking pattern: `{unicode}`'s `[0-9A-Fa-f]{1,6}` admits six match lengths per escape position, and {nmchar} appears under `*` in {name}. When the `{ident}\({w}` rule fails (no `(` after an identifier-shaped prefix), the engine backtracks through `{nmchar}*` for 6**N parses. Payload `\aaaaaa\aaaaaa...X` triggers it: at n=8 it takes 330ms, at n=10 it takes 11.4s. Wrap the body alternations of {nmchar} and {nmstart} in atomic groups, mirroring the prior STRING-rule fix. Each nmchar/nmstart match is locked once committed, so the outer `{nmchar}*` can release whole iterations but cannot try alternative inner consumption of the {1,6} hex run. Add a benchmark test asserting linear time, similar to previous. ref: GHSA-c4rq-3m3g-8wgx (cherry picked from commit 9bada21)

JRuby's JIT warmup variance makes per-call timings too noisy for the R**2 >= 0.99 linearity assertion. Observed CI failures with R**2 around 0.94-0.97 even though the regex itself is unchanged between engines. The ReDoS property is determined by the regex, not the engine (Joni and Onigmo implement the same matching semantics), so MRI coverage is sufficient evidence the fixes hold. (cherry picked from commit 760bde0)

flavorjones added 3 commits April 27, 2026 12:51

flavorjones added topic/security topic/css backport Backport of a PR to the current release branch labels Apr 27, 2026

flavorjones merged commit 7501a63 into v1.19.x Apr 27, 2026
162 checks passed

flavorjones deleted the regex-backtracking-redos_v1.19.x branch April 27, 2026 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: backtracking in CSS tokenizer rules (v1.19.x backport)#3627

fix: backtracking in CSS tokenizer rules (v1.19.x backport)#3627
flavorjones merged 3 commits intov1.19.xfrom
regex-backtracking-redos_v1.19.x

flavorjones commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

flavorjones commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant