JustHTML can parse both Unicode strings (str) and raw byte streams (bytes, bytearray, memoryview).
If you pass bytes, JustHTML will sniff and decode the input using the HTML Standard’s encoding rules.
- If
htmlis astr: no sniffing/decoding happens (it’s already decoded). - If
htmlis bytes-like: JustHTML decodes it into astrbefore tokenization.
The chosen encoding is exposed as doc.encoding when you use JustHTML(...).
If no encoding information is found, HTML parsing defaults to Windows-1252 (often called “cp1252”). This can be surprising if you expect UTF-8 everywhere, but it’s important for legacy HTML:
- Many older documents were authored as “Latin-1” without an explicit encoding.
- Browsers historically treated this as Windows-1252, not ISO-8859-1.
- Using the same default makes JustHTML behave like browsers on real-world old documents.
For byte input, JustHTML follows the standard precedence:
- Transport encoding override (what you pass as
encoding=) - BOM (byte order mark)
- **
<meta charset=...>/<meta http-equiv=... content=...>in the initial bytes - Fallback to
windows-1252
JustHTML also treats utf-7 labels as unsafe and falls back to windows-1252.
from justhtml import JustHTML
from pathlib import Path
data = Path("page.html").read_bytes()
doc = JustHTML(data)
print(doc.encoding)If you already know the correct encoding (e.g. from HTTP headers, file metadata, or your application protocol), pass it as encoding=.
from justhtml import JustHTML
from pathlib import Path
data = Path("page.html").read_bytes()
doc = JustHTML(data, encoding="utf-8")from justhtml import JustHTML
from pathlib import Path
data = Path("page.html").read_bytes()
html = data.decode("utf-8", errors="replace")
doc = JustHTML(html)The streaming API supports the same byte-input behavior:
from justhtml import stream
from pathlib import Path
for event, data in stream(Path("page.html").read_bytes()):
...To override the encoding:
from justhtml import stream
from pathlib import Path
for event, data in stream(Path("page.html").read_bytes(), encoding="utf-8"):
...