Skip to content

Parser hangs on Perl POD-embedded language sections (zero-width escape + embed) #650

@stefanobaghino

Description

@stefanobaghino

Summary

The parser hangs indefinitely on testdata/Packages/Perl/syntax_test_perl.pl (5899 lines) from the upstream sublimehq/Packages submodule bumped in #630. Even the first ~50 lines are enough to wedge a --release build for 3+ minutes with no CPU-use change and no forward progress. The full file was observed to run 40+ minutes locally before being killed; on CI the "Build and test" job consistently ran past the 1-hour GitHub Actions cutoff.

Earlier hypotheses — large warning volume, eprintln! overhead for expired branch points, general parsing slowness — cannot account for that runtime. The real cause is a single-file parser hang.

The file is currently skipped in examples/syntest.rs (should_skip, around line 756) as a temporary CI unblocker. The skip is logged when --summary is not set and is invisible in the known-failures diff (which uses --summary), so the baseline files stay clean. This issue tracks the underlying parser fix.

Reproduction

cargo build --release --example syntest

mkdir -p /tmp/perl-isolated
cp testdata/Packages/Perl/syntax_test_perl.pl /tmp/perl-isolated/

# Hangs indefinitely.
./target/release/examples/syntest /tmp/perl-isolated testdata/Packages --summary

Running the same binary against any other folder (Haskell, Markdown, CSS, Bash, …) completes in under a second once the syntax set has loaded.

Likely trigger

The bumped Perl syntax (testdata/Packages/Perl/Perl.sublime-syntax:257-290) declares six POD-embedded language sections using this pattern:

- match: \bjson\b
  scope: constant.other.language-name.perl
  embed: scope:source.json
  embed_scope: source.json.embedded.perl
  escape: (?=^{{pod}})

where {{pod}} resolves to =[_[:alpha:]]\w* (matches =begin, =end, =cut, etc. at start of line).

Three things conspire:

  1. Zero-width escape: (?=…) is a lookahead — the match is real but consumes no characters.
  2. Embed + escape: the embed pushes source.json onto the stack; when the escape fires it pops back without advancing position.
  3. Branch points in the embedded syntax: JSON (and SQL, HTML) use branch_point / branch / fail to disambiguate structures.

When these combine at a line boundary, the parser re-enters the same scope at the same position after the escape — triggering either an effectively-infinite backtracking cascade or a literal loop that the existing loop-protection doesn't catch.

@michaelblyons suggested in #630 (comment) that the culprit is likely the comment-pod-verbatim-body context (Perl.sublime-syntax:249) and specifically zero-length escape matches being rewound inside a branch — worth preserving as an investigation lead.

Loop-protection gap

Two existing guards, neither fires:

  • src/parsing/parser.rs:479-500would_loop handling for a non-consuming push immediately followed by a non-consuming pop. Advances the cursor by one character when triggered.
  • src/parsing/parser.rs:676-683push_too_deep guard, bails on push/branch/embed once stack depth ≥ 100.

The zero-width-escape + embedded-branch pattern trips neither: the stack stays below 100, and the push/pop pair isn't detected as non-consuming because the escape itself is a match (just zero-width), not an empty push followed by an empty pop. Escape matches take the early-return path in find_best_match (parser.rs:716) with would_loop: false hard-coded — the loop guard never evaluates them.

Not introduced by the bump

Bisection on the #618 branch showed the hang is present from the first commit that bumps the Packages submodule:

Commit Description Result
f42c24a Bump testdata/Packages + load-time fixes hangs
ab5ca7a pop + embed combination support hangs
4f30fb1 pop + branch, escape regex re-resolve, nested escape dispatch hangs
HEAD (at time of investigation) + CI relaxations, metadata test fixes hangs

Later commits are correctness fixes that let the file load successfully so the parser has the chance to hang — they don't introduce the hang.

Prior discussion: #630 (comment) (initial problem statement) and #630 (comment) (skip rationale in the reader's guide).

Proposed fix (two candidates)

  1. Extend would_loop detection to include zero-width escape matches: detect that the escape match is zero-width and treat it the same as a non-consuming pop for loop-protection purposes.
  2. Short-circuit embedded-syntax re-entry at unchanged position: track whether the embed context was just exited via a zero-width escape and refuse to re-enter it at the same position without advancing.

Either fix must preserve the correctness of legitimate embed + escape cases (HTML-in-Ruby, JS-in-HTML, …) that do progress. A minimal synthetic reproduction against the parser's own tests is the right starting point; the Perl file is too large and has too many interacting patterns to serve as a primary test case.

Other candidates to re-check after the fix

Once the Perl hang is fixed, remove the skip in examples/syntest.rs and re-run make syntest. Candidates with similar POD-like + embed + branch-point patterns that haven't been observed to hang but share the shape:

  • Ruby heredocs with embedded SQL / JS.
  • Markdown fenced code blocks with embedded language syntaxes.
  • HTML (Rails) with ERB + embedded Ruby/JS/CSS.

Related

Refs: #631.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions