Summary
to_markdown() does not sufficiently escape text content that looks like HTML. As a result, untrusted input that is safe in to_html() can become raw HTML in Markdown output.
This is not specific to tokenizer raw-text states like <title>, <noscript>, or <plaintext>, although those states can trigger the behavior. The root cause is broader: Markdown text serialization leaves angle brackets unescaped in text nodes.
Details
When converting a parsed document to Markdown, text nodes are escaped for a small set of Markdown metacharacters, but HTML-significant characters such as < and > are preserved. That means content parsed as text, including entity-decoded text or text produced by RCDATA/RAWTEXT-style parsing, can be emitted into Markdown as raw HTML.
Examples of affected input include:
- Text produced from entity-decoded input such as
<script>...</script>
- Text inside elements like
<title>, <textarea>, <noscript> (when parsed as raw text), and <plaintext>
This is distinct from actual <script> or <style> elements in the DOM. Those are already dropped by default in to_markdown() unless html_passthrough=True.
Proof of Concept
General case
from justhtml import JustHTML
doc = JustHTML("<p><img src=x onerror=alert(1)></p>", fragment=True)
print(doc.to_html())
print()
print(doc.to_markdown())
### References
- https://github.com/EmilStenstrom/justhtml/security/advisories/GHSA-3rcm-vjrc-p45j
Summary
to_markdown()does not sufficiently escape text content that looks like HTML. As a result, untrusted input that is safe into_html()can become raw HTML in Markdown output.This is not specific to tokenizer raw-text states like
<title>,<noscript>, or<plaintext>, although those states can trigger the behavior. The root cause is broader: Markdown text serialization leaves angle brackets unescaped in text nodes.Details
When converting a parsed document to Markdown, text nodes are escaped for a small set of Markdown metacharacters, but HTML-significant characters such as
<and>are preserved. That means content parsed as text, including entity-decoded text or text produced by RCDATA/RAWTEXT-style parsing, can be emitted into Markdown as raw HTML.Examples of affected input include:
<script>...</script><title>,<textarea>,<noscript>(when parsed as raw text), and<plaintext>This is distinct from actual
<script>or<style>elements in the DOM. Those are already dropped by default into_markdown()unlesshtml_passthrough=True.Proof of Concept
General case