Skip to content

Commit db4788d

Browse files
committed
fix(postprocess): keep the apply and skip-list walks in sync on tag interiors
Re-review of cbb6b2a surfaced two more instances of the same divergence class, one on each side. The nofollow path advanced only past "<a " after injecting rel, leaving the cursor inside the href value; a <!-- or raw-text tag name planted there (the comment form survives the default tagfilter) was probed as a region opener and the closerless region ran to end-of-document, stripping rel from every later link and id from every later heading. The cursor now jumps the whole anchor tag with scan_tag_close while the attribute bytes still flush verbatim. compute_skip_list treated scan_tag_closes html_len return (unterminated tag) as a jump to end-of-document and stopped recording skip regions, while apply_transforms advances one byte past the same tag and keeps skipping. The desynced skip list let a heading fingerprint resolve inside a region the apply pass then skipped, and the heading lost its id. The unterminated return now advances one byte there too, matching the apply pass. Both reproduced against the built extension before the change; the new tests fail on the parent commit and pass now, full suite 35/35. Found by a codesage re-review of the fix commits.
1 parent 1f62549 commit db4788d

4 files changed

Lines changed: 130 additions & 3 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1313

1414
### Fixed
1515

16+
- `nofollowLinks`: a skip-region opener (`<!--`, `<script>`, ...) inside an
17+
`href` value in raw HTML under `unsafe: true` fabricated a skip range to
18+
end-of-document, silently stripping `rel="nofollow"` from every later
19+
link and `id` from every later heading. Regression test:
20+
`tests/044_nofollow_attr_opener.phpt`.
21+
- `headingAnchors`: an unterminated raw tag truncated the precomputed
22+
skip list while the apply pass kept honoring later skip regions; the
23+
desync silently dropped the `id` of a heading whose fingerprint also
24+
appeared in such a region. Regression test:
25+
`tests/045_skiplist_unterminated_tag.phpt`.
1626
- Module lifecycle: a second MINIT after MSHUTDOWN in the same process
1727
(embedded SAPI cycle, a dlclose that doesn't unload) failed with
1828
E_CORE_ERROR because cmark's one-shot registration guards survived

mdparser_html_postprocess.c

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -368,8 +368,14 @@ static bool compute_skip_list(const char *html, size_t html_len,
368368
i = (end > i) ? end : i + 1;
369369
continue;
370370
}
371+
/* html_len means an unterminated tag: advance one byte and keep
372+
* scanning, exactly like apply_transforms' literal-byte path.
373+
* Jumping to end-of-document instead would leave every later
374+
* skip region unrecorded while apply_transforms still honors
375+
* them, and the desynced skip list lets a heading fingerprint
376+
* resolve inside a region the apply pass then skips. */
371377
size_t tag_end = scan_tag_close(html, i, html_len);
372-
i = (tag_end != SIZE_MAX && tag_end > i) ? tag_end : i + 1;
378+
i = (tag_end != SIZE_MAX && tag_end < html_len) ? tag_end : i + 1;
373379
}
374380
return true;
375381
}
@@ -878,8 +884,17 @@ static zend_string *apply_transforms(const char *html, size_t html_len,
878884
}
879885
smart_str_appendl(&out, "<a ", 3);
880886
smart_str_appendl(&out, rel_inject, rel_inject_len);
881-
i += 3;
882-
run_start = i;
887+
/* Jump the rest of the anchor tag with the quote-aware
888+
* scanner. Advancing only past "<a " leaves the cursor
889+
* inside the href value, where an embedded region
890+
* opener (`<!--`, `<script>`) fabricates a skip region
891+
* running to end-of-document. The attribute bytes still
892+
* flush verbatim from run_start; an unterminated tag
893+
* (html_len) falls back to the old +3 advance and the
894+
* loop's literal-byte handling. */
895+
size_t a_end = scan_tag_close(html, i, html_len);
896+
run_start = i + 3;
897+
i = (a_end != SIZE_MAX && a_end < html_len) ? a_end : i + 3;
883898
continue;
884899
}
885900
}
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
--TEST--
2+
nofollowLinks: skip-region opener inside an href must not eat later transforms
3+
--EXTENSIONS--
4+
mdparser
5+
--FILE--
6+
<?php
7+
8+
/* After injecting rel into `<a href="`, apply_transforms used to
9+
* advance only past the "<a " bytes, leaving the cursor inside the
10+
* href value. A `<!--` or raw-text tag name embedded there (reachable
11+
* under unsafe; the comment form survives the default tagfilter) was
12+
* then probed as a region opener, and the closerless region ran to
13+
* end-of-document, stripping rel from every later link and id from
14+
* every later heading. The cursor now jumps the whole anchor tag,
15+
* mirroring compute_skip_list's quote-aware walk. */
16+
17+
function check(string $label, bool $cond): void {
18+
echo ($cond ? "OK" : "FAIL"), ": $label\n";
19+
}
20+
21+
$p = new MdParser\Parser(new MdParser\Options(unsafe: true, nofollowLinks: true));
22+
23+
$h = $p->toHtml("<a href=\"x<!--y\">l</a>\n\n[z](http://example.com/)\n");
24+
check("comment opener in href: later link keeps rel",
25+
str_contains($h, '<a rel="nofollow noopener noreferrer" href="http://example.com/">z</a>'));
26+
check("comment opener in href: decoy anchor itself got rel",
27+
str_contains($h, '<a rel="nofollow noopener noreferrer" href="x<!--y">l</a>'));
28+
29+
// Control: same document without the embedded opener.
30+
$h = $p->toHtml("<a href=\"xy\">l</a>\n\n[z](http://example.com/)\n");
31+
check("control: later link keeps rel",
32+
str_contains($h, '<a rel="nofollow noopener noreferrer" href="http://example.com/">z</a>'));
33+
34+
/* Raw-text tag name in the href needs tagfilter off to survive. */
35+
$p = new MdParser\Parser(new MdParser\Options(unsafe: true, tagfilter: false, nofollowLinks: true));
36+
$h = $p->toHtml("<a href=\"x<script>y\">l</a>\n\n[z](http://example.com/)\n");
37+
check("script opener in href: later link keeps rel",
38+
str_contains($h, '<a rel="nofollow noopener noreferrer" href="http://example.com/">z</a>'));
39+
40+
/* Heading anchors after the decoy must survive too. */
41+
$p = new MdParser\Parser(new MdParser\Options(unsafe: true, nofollowLinks: true, headingAnchors: true));
42+
$h = $p->toHtml("<a href=\"x<!--y\">l</a>\n\n# foo\n");
43+
check("comment opener in href: later heading keeps id",
44+
str_contains($h, '<h1 id="foo">foo</h1>'));
45+
46+
/* A real region after the decoy anchor is still honored. */
47+
$h = $p->toHtml("<a href=\"x<!--y\">l</a>\n\n<!-- <a href=\"http://spam.example/\">s</a> -->\n\n[z](http://example.com/)\n");
48+
check("real comment after decoy: body untouched",
49+
str_contains($h, '<!-- <a href="http://spam.example/">s</a> -->'));
50+
check("real comment after decoy: later link keeps rel",
51+
str_contains($h, '<a rel="nofollow noopener noreferrer" href="http://example.com/">z</a>'));
52+
53+
?>
54+
--EXPECT--
55+
OK: comment opener in href: later link keeps rel
56+
OK: comment opener in href: decoy anchor itself got rel
57+
OK: control: later link keeps rel
58+
OK: script opener in href: later link keeps rel
59+
OK: comment opener in href: later heading keeps id
60+
OK: real comment after decoy: body untouched
61+
OK: real comment after decoy: later link keeps rel
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
--TEST--
2+
headingAnchors: unterminated raw tag must not truncate the skip list
3+
--EXTENSIONS--
4+
mdparser
5+
--FILE--
6+
<?php
7+
8+
/* scan_tag_close returns html_len for a tag with an unclosed quote.
9+
* compute_skip_list used to take that as a jump-to-end and stopped
10+
* recording skip regions, while apply_transforms advances one byte
11+
* past the same tag and keeps skipping regions. The desync let
12+
* resolve_heading_offsets place a heading fingerprint inside a region
13+
* apply_transforms then skipped, so the real heading silently lost
14+
* its id. compute_skip_list now treats the unterminated tag the same
15+
* way: advance one byte, keep scanning. */
16+
17+
function check(string $label, bool $cond): void {
18+
echo ($cond ? "OK" : "FAIL"), ": $label\n";
19+
}
20+
21+
$p = new MdParser\Parser(new MdParser\Options(headingAnchors: true, unsafe: true, tagfilter: false));
22+
23+
$h = $p->toHtml("<div data=\"x\n\n<style>\n<h1>foo</h1>\n</style>\n\n# foo\n");
24+
check("unterminated tag + decoy region: heading keeps id",
25+
str_contains($h, '<h1 id="foo">foo</h1>'));
26+
27+
// Control: terminated tag, same decoy region.
28+
$h = $p->toHtml("<div data=\"x\">\n\n<style>\n<h1>foo</h1>\n</style>\n\n# foo\n");
29+
check("control: heading keeps id", str_contains($h, '<h1 id="foo">foo</h1>'));
30+
31+
// Comment-region variant works with the default tagfilter.
32+
$p2 = new MdParser\Parser(new MdParser\Options(headingAnchors: true, unsafe: true));
33+
$h = $p2->toHtml("<div data=\"x\n\n<!-- <h1>foo</h1> -->\n\n# foo\n");
34+
check("unterminated tag + decoy comment: heading keeps id",
35+
str_contains($h, '<h1 id="foo">foo</h1>'));
36+
37+
?>
38+
--EXPECT--
39+
OK: unterminated tag + decoy region: heading keeps id
40+
OK: control: heading keeps id
41+
OK: unterminated tag + decoy comment: heading keeps id

0 commit comments

Comments
 (0)