Summary
When an <em> or <strong> element contains leading or trailing whitespace inside the tag trafilatura's output_format="markdown" serializer copies that whitespace verbatim between the emphasis delimiters. The resulting markdown is not valid CommonMark: the delimiter runs fail the flanking rules in https://spec.commonmark.org/0.31.2/#emphasis-and-strong-emphasis, so a downstream CommonMark parser will either fail to parse the emphasis at all or, worse, mis-pair the delimiters across unrelated emphasis spans.
Reproducer
import trafilatura
html = """<html><body><article>
<p>This is <em>really </em>hard to do.</p>
<p>Then a <strong>bold </strong>word.</p>
<p>Now <em> spaced both sides </em> here.</p>
</article></body></html>"""
print(trafilatura.extract(html, output_format="markdown"))
Actual output
This is *really *hard to do.
Then a **bold **word.
Now * spaced both sides * here.
Expected output
Whitespace adjacent to the delimiter should be moved outside the emphasis run so the emphasis is preserved as valid CommonMark:
This is *really* hard to do.
Then a **bold** word.
Now *spaced both sides* here.
Incorrect parsing
>>> import mistune
>>> mistune.html("This is *really *hard to do, and *quite* tricky.")
'<p>This is <em>really *hard to do, and *quite</em> tricky.</p>'
A CommonMark-compliant parser does not see two separate emphasis runs in *really *hard ... *quite*. Because *really is left-flanking and the * after quite is right-flanking, the parser pairs them, swallowing the whole span as one emphasis:
The literal * between *hard and quite* ends up as text inside an <em>. Document structure is corrupted, not just decorated with stray asterisks.
The list-item case is similar: * spaced both sides * at the start of a paragraph is parsed as a <ul><li> because *<space> is the bullet marker.
Source
This HTML is commonly emitted from WYSIWYG editors. The HTML renders identically to the whitespace-outside form, so authors don't notice and CMSs don't normalize it.
Version
- trafilatura 2.0.0 (also reproduced against
master at ee1865b)
- Python 3.13 / 3.14
Summary
When an
<em>or<strong>element contains leading or trailing whitespace inside the tag trafilatura'soutput_format="markdown"serializer copies that whitespace verbatim between the emphasis delimiters. The resulting markdown is not valid CommonMark: the delimiter runs fail the flanking rules in https://spec.commonmark.org/0.31.2/#emphasis-and-strong-emphasis, so a downstream CommonMark parser will either fail to parse the emphasis at all or, worse, mis-pair the delimiters across unrelated emphasis spans.Reproducer
Actual output
Expected output
Whitespace adjacent to the delimiter should be moved outside the emphasis run so the emphasis is preserved as valid CommonMark:
Incorrect parsing
A CommonMark-compliant parser does not see two separate emphasis runs in
*really *hard ... *quite*. Because*reallyis left-flanking and the*afterquiteis right-flanking, the parser pairs them, swallowing the whole span as one emphasis:The literal
*between*hardandquite*ends up as text inside an<em>. Document structure is corrupted, not just decorated with stray asterisks.The list-item case is similar:
* spaced both sides *at the start of a paragraph is parsed as a<ul><li>because*<space>is the bullet marker.Source
This HTML is commonly emitted from WYSIWYG editors. The HTML renders identically to the whitespace-outside form, so authors don't notice and CMSs don't normalize it.
Version
masteratee1865b)