Sanitizer Bypass (in Markdown)

Summary

to_markdown() does not sufficiently escape text content that looks like HTML. As a result, untrusted input that is safe in to_html() can become raw HTML in Markdown output.

This is not specific to tokenizer raw-text states like <title>, <noscript>, or <plaintext>, although those states can trigger the behavior. The root cause is broader: Markdown text serialization leaves angle brackets unescaped in text nodes.

Details

When converting a parsed document to Markdown, text nodes are escaped for a small set of Markdown metacharacters, but HTML-significant characters such as < and > are preserved. That means content parsed as text, including entity-decoded text or text produced by RCDATA/RAWTEXT-style parsing, can be emitted into Markdown as raw HTML.

Examples of affected input include:

Text produced from entity-decoded input such as <script>...</script>
Text inside elements like <title>, <textarea>, <noscript> (when parsed as raw text), and <plaintext>

This is distinct from actual <script> or <style> elements in the DOM. Those are already dropped by default in to_markdown() unless html_passthrough=True.

Proof of Concept

General case

from justhtml import JustHTML

doc = JustHTML("<p>&lt;img src=x onerror=alert(1)&gt;</p>", fragment=True)

print(doc.to_html())
print()
print(doc.to_markdown())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitizer Bypass (in Markdown)

Package

Affected versions

Patched versions

Description

Summary

Details

Proof of Concept

General case

Severity

CVE ID

Weaknesses

Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting')

Credits