Skip to content

Markdown output emits invalid CommonMark when <em>/<strong> contain leading or trailing whitespace #843

@movermeyer

Description

@movermeyer

Summary

When an <em> or <strong> element contains leading or trailing whitespace inside the tag trafilatura's output_format="markdown" serializer copies that whitespace verbatim between the emphasis delimiters. The resulting markdown is not valid CommonMark: the delimiter runs fail the flanking rules in https://spec.commonmark.org/0.31.2/#emphasis-and-strong-emphasis, so a downstream CommonMark parser will either fail to parse the emphasis at all or, worse, mis-pair the delimiters across unrelated emphasis spans.

Reproducer

import trafilatura

html = """<html><body><article>
<p>This is <em>really </em>hard to do.</p>
<p>Then a <strong>bold </strong>word.</p>
<p>Now <em> spaced both sides </em> here.</p>
</article></body></html>"""

print(trafilatura.extract(html, output_format="markdown"))

Actual output

This is *really *hard to do.

Then a **bold **word.

Now * spaced both sides * here.

Expected output

Whitespace adjacent to the delimiter should be moved outside the emphasis run so the emphasis is preserved as valid CommonMark:

This is *really* hard to do.

Then a **bold** word.

Now *spaced both sides* here.

Incorrect parsing

>>> import mistune
>>> mistune.html("This is *really *hard to do, and *quite* tricky.")
'<p>This is <em>really *hard to do, and *quite</em> tricky.</p>'

A CommonMark-compliant parser does not see two separate emphasis runs in *really *hard ... *quite*. Because *really is left-flanking and the * after quite is right-flanking, the parser pairs them, swallowing the whole span as one emphasis:

The literal * between *hard and quite* ends up as text inside an <em>. Document structure is corrupted, not just decorated with stray asterisks.

The list-item case is similar: * spaced both sides * at the start of a paragraph is parsed as a <ul><li> because *<space> is the bullet marker.

Source

This HTML is commonly emitted from WYSIWYG editors. The HTML renders identically to the whitespace-outside form, so authors don't notice and CMSs don't normalize it.

Version

  • trafilatura 2.0.0 (also reproduced against master at ee1865b)
  • Python 3.13 / 3.14

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions