Release 0.3.0 · iliaal/mdparser

Added

MdParser\Options::headingAnchors: when true, every rendered <hN> gets an id attribute holding a GitHub-style slug of the heading's text. Slugs lowercase ASCII, replace whitespace runs with a single -, drop other ASCII punctuation, preserve UTF-8 multibyte bytes, and dedupe collisions with -1, -2, ... Headings whose text slugifies to nothing (pure punctuation) emit <hN> with no id rather than id="". Coexists with sourcepos: the id lands before data-sourcepos.
MdParser\Options::nofollowLinks: when true, every emitted <a href="..."> gets rel="nofollow noopener noreferrer" injected for inline links, reference links, and autolinks. Applies to toHtml() and toInlineHtml(). Anchors inside fenced or inline code are left untouched because cmark escapes them before reaching the postprocess step. In-document fragment anchors (href="#...", i.e. footnote references and backrefs) are intentionally skipped. Raw <script> / <style> regions under unsafe: true are emitted verbatim so anchor-shaped substrings inside JavaScript or CSS are not corrupted.
Linux and macOS prebuilt binaries are now attached to every GitHub release (x86_64 + arm64 glibc Linux, x86_64 + arm64 macOS, PHP 8.4 and 8.5, NTS). PIE picks the matching .so first and only falls back to a source build for combinations not covered by an asset (e.g. PHP 8.3, Alpine/musl, ZTS). composer.json declares download-url-method: ["pre-packaged-binary", "composer-default"] to opt into the prebuilt path.

Both new HTML-postprocess flags default to false. They are pure HTML post-passes; XML and AST output are unaffected. The static Parser::html() / Parser::xml() shortcuts use the module defaults and so do not apply either transform.

Heading anchors are positioned by rendering each AST heading standalone and locating its exact byte sequence in the document HTML, rather than by counting line-start <hN> tags. Under unsafe: true, raw HTML headings written directly in the markdown source are normally left alone and do not consume slugs intended for real headings. One documented limitation: if a raw HTML heading produces bytes identical to a later Markdown heading (e.g. <h1>same</h1> followed by # same), the byte-fingerprint search hits the raw heading first, the raw heading absorbs the id, and the real Markdown heading is left without one. A durable fix needs renderer-level heading-id support; until then, unsafe: true callers should not rely on heading-id stability when raw HTML headings can collide with real ones. Pinned in tests/030_anchor_unsafe_collision.phpt.

Changed

Parser now caches a single cmark_parser per instance and reuses it across toHtml / toXml / toAst / toInlineHtml calls. cmark_parser_finish resets the parser internally on every successful render, so the cached parser holds no state from prior input: no link reference definitions, no inline subject leftovers, no buffered partial input. After a render that did not complete cleanly the parser is rebuilt rather than reused. Pinned in tests/033_parser_reuse_isolation.phpt.
cmark allocations now route through a Zend MM-backed cmark_mem (ecalloc / erealloc / efree). cmark-side memory is now accounted by memory_limit, surfaced by memory_get_usage(), and cleaned up by Zend MM on bailout. Out-of-memory under hostile or oversized input goes through PHP's standard Allowed memory size exhausted fatal instead of cmark's default-allocator abort().
AST node-type values, list type / delim values, and table alignment values are now permanent interned strings created at MINIT, eliminating ~1 emalloc + memcpy per AST node on toAst().
AST key strings (type, children, literal, level, ...) are now permanent interned strings created at MINIT via zend_string_init_interned(..., true) instead of persistent non-interned zend_strings lazy-initialized on the first toAst() call. Permanent interned strings skip refcount mutation during zend_hash_add_new, so concurrent toAst() calls on a ZTS build no longer race the (non-atomic) shared refcount that the previous persistent strings carried.
AST node array preallocation bumped from array_init_size(out, 8) to 16. The worst-case node (a list with sourcepos: true) carries 10 keys, so 8 forced a rehash on every list. 16 lands on the next power-of-two HT bucket size and avoids the rehash for every supported node shape.
HTML postprocess failure messages distinguish AST depth-cap (heading text exceeded MDPARSER_MAX_AST_DEPTH) from cmark iterator/render allocation failure, instead of collapsing all three reasons into the generic "HTML postprocess allocation failure" string.

Fixed

Parser::toInlineHtml() no longer lets block-level markers (#, -, >, 1., four-space indent, fenced/HTML blocks, thematic breaks) fire on lines after the first. The source-rewrite step now normalizes \r\n and lone \r to \n, collapses runs of newlines, drops leading/trailing newlines, and inserts a U+200B sentinel at the start of every physical line; the output stripper removes the wrapper plus every per-line sentinel. Multi-line input is therefore guaranteed to render as inline content.
PHP 8.6 compatibility: replaced XtOffsetOf with offsetof throughout the wrapper. php-src master removed the XtOffsetOf portability macro from zend_portability.h; offsetof from <stddef.h> is the documented replacement and works on every PHP version mdparser supports.
config.w32 now lists mdparser_html_postprocess.c so Windows builds link successfully.

Security

HTML postprocess no longer splices into raw-HTML attribute values, HTML comments, CDATA, or escapable-raw-text element bodies. Under unsafe: true, tagfilter: false, nofollowLinks: true, attacker-authored bytes inside <title>, <textarea>, <iframe>, <noscript>, <xmp>, <noembed>, <noframes>, <plaintext>, , <![CDATA[ … ]]>, or quoted attribute values like <div title='<a href="x">…'> previously matched the postprocessor's <a href=" pattern and rewrote bytes inside those regions, producing malformed HTML that could splice attributes onto the surrounding tag. The skip-region scanner now covers all HTML5 raw-text / escapable-raw-text elements + comments + CDATA, and apply_transforms walks tag-by-tag (with quoted-attribute awareness) so positions inside attribute values are never visited as tag-starts. Same logic applies to the heading-anchor fingerprint search in resolve_heading_offsets, closing the comment / CDATA / textarea slug-hijack vector. Pinned in tests/031_postprocess_attribute_safety.phpt.
Heading slugs now percent-encode invalid UTF-8 byte sequences (lone continuation bytes, overlong leads, truncated multi-byte sequences) instead of letting them land verbatim in id="…". Valid UTF-8 multi-byte sequences (e.g. 日本語) still pass through. Reachable when callers turn off validateUtf8.
Parser::toInlineHtml() no longer pre-allocates 4 * src_len + 3 for the normalized scratch buffer. Newline-heavy input well below the documented 256 MB cap previously fataled on the scratch allocation under tight memory_limit (40 MB of \n allocated ~168 MB even though the normalized buffer was empty). The scratch buffer now grows on demand via smart_str and tracks the actual normalized size. Pinned in tests/037_toinlinehtml_memory_limit.phpt.
Options objects built via ReflectionClass::newInstanceWithoutConstructor() are now rejected at Parser::__construct() with MdParser\Exception. Previously, reading uninitialized typed properties returned IS_NULL to silent property reads, so the parser cached an all-false mask (notably validateUtf8: false and tagfilter: false) while $parser->options still threw on any property access. The constructor now bails before publishing $options, so a half-built Options can never reach cached parser state. Regression test in tests/029_regressions.phpt.
Linux build compiled with -fvisibility=hidden. Vendored cmark symbols (cmark_parser_new, cmark_release_plugins, CMARK_DEFAULT_MEM_ALLOCATOR, ...) and wrapper internals no longer appear in mdparser.so's dynamic symbol table; only PHP's required get_module is exported. Prevents symbol collisions with other extensions that vendor or link cmark.
Windows release workflow pins php/php-windows-builder/* references to a commit SHA instead of the mutable @v1 tag, so a moved or compromised tag cannot push DLLs into a release with contents: write.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.3.0

Choose a tag to compare

Sorry, something went wrong.