Skip to content

Commit bbd8d38

Browse files
committed
benchmarks: Add MarkupEver, https://awolverp.github.io/markupever/
MarkupEver is based on the Rust html5ever library; it seems reasonably correct and very fast, so well worth adding to the comparison. Benchmark results on my system: Parser Total (s) Mean (ms) Peak (MB) Delta (MB) ---------------------------------------------------------------------------------------------------- justhtml 4.161 8.323 146.6 101.7 html5lib 6.377 12.753 171.1 117.2 (1.53x slower) lxml 0.346 0.692 65.0 21.3 (12.03x faster) bs4 4.325 8.651 135.7 85.3 (1.04x slower) html.parser 1.565 3.131 52.6 8.2 (2.66x faster) selectolax 0.219 0.437 68.0 10.5 (19.04x faster) gumbo 1.194 2.387 70.6 25.4 (3.49x faster) markupever 0.435 0.870 64.9 21.0 (9.56x faster) Signed-off-by: Anders Kaseorg <andersk@mit.edu>
1 parent d4c2257 commit bbd8d38

File tree

5 files changed

+138
-7
lines changed

5 files changed

+138
-7
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -80,12 +80,13 @@ A pure Python HTML5 parser that just works. No C extensions to compile. No syste
8080
| **Chromium**<br>browser engine |**99%** | 🚀&nbsp;Very&nbsp;Fast ||||
8181
| **WebKit**<br>browser engine |**98%** | 🚀 Very Fast ||||
8282
| **Firefox**<br>browser engine |**97%** | 🚀 Very Fast ||||
83+
| **`markupever`**<br>Python wrapper of Rust-based html5ever |**95%** | 🚀 Very Fast | ✅ CSS selectors | ❌ Needs sanitization | Fast and correct. |
8384
| **`html5lib`**<br>Pure Python | 🟡 88% | 🐢 Slow | 🟡 XPath (lxml) | 🔴 [Deprecated](https://github.com/html5lib/html5lib-python/issues/443) | Unmaintained. Reference implementation; Correct but quite slow. |
8485
| **`html5_parser`**<br>Python wrapper of C-based Gumbo | 🟡 84% | 🚀 Very Fast | 🟡 XPath (lxml) | ❌ Needs sanitization | Fast and mostly correct. |
8586
| **`selectolax`**<br>Python wrapper of C-based Lexbor | 🟡 68% | 🚀 Very Fast | ✅ CSS selectors | ❌ Needs sanitization | Very fast but less compliant. |
87+
| **`BeautifulSoup`**<br>Pure Python | 🔴 5% (default) | 🐢 Slow | 🟡 Custom API | ❌ Needs sanitization | Wraps `html.parser` (default). Can use lxml or html5lib. |
8688
| **`html.parser`**<br>Python stdlib | 🔴 4% | ⚡ Fast | ❌ None | ❌ Needs sanitization | Standard library. Chokes on malformed HTML. |
87-
| **`BeautifulSoup`**<br>Pure Python | 🔴 4% (default) | 🐢 Slow | 🟡 Custom API | ❌ Needs sanitization | Wraps `html.parser` (default). Can use lxml or html5lib. |
88-
| **`lxml`**<br>Python wrapper of C-based libxml2 | 🔴 1% | 🚀 Very Fast | 🟡 XPath | ❌ Needs sanitization | Fast but not HTML5 compliant. Don't use the old lxml.html.clean module! |
89+
| **`lxml`**<br>Python wrapper of C-based libxml2 | 🔴 3% | 🚀 Very Fast | 🟡 XPath | ❌ Needs sanitization | Fast but not HTML5 compliant. Don't use the old lxml.html.clean module! |
8990

9091
[1]: Parser compliance scores are from a strict run of the [html5lib-tests](https://github.com/html5lib/html5lib-tests) tree-construction fixtures (1,743 non-script tests). See [docs/correctness.md](docs/correctness.md) for details.
9192

benchmarks/correctness.py

Lines changed: 85 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
from justhtml.context import FragmentContext
1818

1919
# Available parsers
20-
PARSERS = ["justhtml", "html5lib", "html5_parser", "lxml", "bs4", "html.parser", "selectolax"]
20+
PARSERS = ["justhtml", "html5lib", "html5_parser", "lxml", "bs4", "html.parser", "selectolax", "markupever"]
2121

2222

2323
def check_parser_available(parser_name):
@@ -58,6 +58,13 @@ def check_parser_available(parser_name):
5858
try:
5959
import html5_parser # noqa: F401
6060

61+
return True
62+
except ImportError:
63+
return False
64+
if parser_name == "markupever":
65+
try:
66+
import markupever # noqa: F401
67+
6168
return True
6269
except ImportError:
6370
return False
@@ -409,6 +416,22 @@ def run_test_html5_parser(html, fragment_context, expected, xml_coercion=False,
409416
return False, "", str(e)
410417

411418

419+
def run_test_markupever(html, fragment_context, expected, xml_coercion=False, iframe_srcdoc=False):
420+
"""Run a single test with MarkupEver."""
421+
import markupever
422+
423+
try:
424+
if fragment_context:
425+
nodes = markupever.parse(html, markupever.HtmlOptions(full_document=False)).root().first_child.children()
426+
else:
427+
nodes = [markupever.parse(html).root()]
428+
actual = _markupever_to_test_format(nodes)
429+
passed = compare_outputs(expected, actual)
430+
return passed, actual, None
431+
except Exception as e:
432+
return False, "", str(e)
433+
434+
412435
# =============================================================================
413436
# Test format conversion helpers
414437
# =============================================================================
@@ -794,6 +817,66 @@ def walk(node, indent):
794817
return "\n".join(walk(root, 0))
795818

796819

820+
def _markupever_to_test_format(nodes):
821+
"""Convert MarkupEver DOM to test format."""
822+
import markupever
823+
import markupever.dom
824+
825+
def process(node, indent):
826+
prefix = " " * indent
827+
match node:
828+
case markupever.dom.Document():
829+
for child in node.children():
830+
yield from process(child, indent)
831+
case markupever.dom.Doctype():
832+
if node.public_id or node.system_id:
833+
yield f'| <!DOCTYPE {node.name} "{node.public_id}" "{node.system_id}">\n'
834+
else:
835+
yield f"| <!DOCTYPE {node.name}>\n"
836+
case markupever.dom.Element():
837+
if node.name.ns == NS_SVG:
838+
tag_name = f"svg {node.name.local}"
839+
elif node.name.ns == NS_MATHML:
840+
tag_name = f"math {node.name.local}"
841+
elif node.name.ns == NS_HTML:
842+
tag_name = node.name.local
843+
else:
844+
tag_name = f"{node.name.ns} {node.name.local}"
845+
yield f"| {prefix}<{tag_name}>\n"
846+
847+
attrs = []
848+
for qual_name, value in zip(node.attrs.keys(), node.attrs.values(), strict=True):
849+
if qual_name.ns == NS_XLINK:
850+
attr_name = f"xlink {qual_name.local}"
851+
elif qual_name.ns == NS_XML:
852+
attr_name = f"xml {qual_name.local}"
853+
elif qual_name.ns == NS_XMLNS:
854+
attr_name = f"xmlns {qual_name.local}"
855+
elif qual_name.ns == "":
856+
attr_name = qual_name.local
857+
else:
858+
attr_name = f"{qual_name.ns} {qual_name.local}"
859+
attrs.append((attr_name, value))
860+
for attr_name, value in sorted(attrs):
861+
yield f'| {prefix} {attr_name}="{value}"\n'
862+
863+
if node.name.ns == NS_HTML and node.name.local == "template":
864+
yield f"| {prefix} content\n"
865+
for child in node.children():
866+
yield from process(child, indent + 4)
867+
else:
868+
for child in node.children():
869+
yield from process(child, indent + 2)
870+
case markupever.dom.Text():
871+
yield f'| {prefix}"{node.content}"\n'
872+
case markupever.dom.Comment():
873+
yield f"| {prefix}<!-- {node.content} -->\n"
874+
case _:
875+
raise ValueError(f"Unknown node type {type(node)}")
876+
877+
return "".join(line for node in nodes for line in process(node, 0))
878+
879+
797880
# Parser dispatch
798881
PARSER_RUNNERS = {
799882
"justhtml": run_test_justhtml,
@@ -803,6 +886,7 @@ def walk(node, indent):
803886
"bs4": run_test_bs4,
804887
"html.parser": run_test_html_parser,
805888
"selectolax": run_test_selectolax,
889+
"markupever": run_test_markupever,
806890
}
807891

808892

benchmarks/performance.py

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -567,6 +567,47 @@ def benchmark_gumbo(html_source, iterations=1):
567567
}
568568

569569

570+
def benchmark_markupever(html_source, iterations=1):
571+
"""Benchmark markupever parser."""
572+
try:
573+
from markupever import parse
574+
except ImportError:
575+
return {"error": "markupever not installed (pip install markupever)"}
576+
times = []
577+
errors = 0
578+
total_bytes = 0
579+
file_count = 0
580+
warmup_done = False
581+
for _, html in html_source:
582+
if not warmup_done:
583+
try:
584+
parse(html)
585+
except Exception:
586+
pass
587+
warmup_done = True
588+
total_bytes += len(html)
589+
file_count += 1
590+
for _ in range(iterations):
591+
try:
592+
start = time.perf_counter()
593+
result = parse(html)
594+
elapsed = time.perf_counter() - start
595+
times.append(elapsed)
596+
_ = result.root()
597+
except Exception:
598+
errors += 1
599+
return {
600+
"total_time": sum(times),
601+
"mean_time": sum(times) / len(times) if times else 0,
602+
"min_time": min(times) if times else 0,
603+
"max_time": max(times) if times else 0,
604+
"errors": errors,
605+
"success_count": len(times),
606+
"file_count": file_count,
607+
"total_bytes": total_bytes,
608+
}
609+
610+
570611
def _benchmark_worker(bench_fn, html_files, iterations, queue):
571612
"""Worker function to run benchmark in a separate process."""
572613
try:
@@ -630,6 +671,7 @@ def print_results(results, file_count, iterations=1):
630671
"html.parser",
631672
"selectolax",
632673
"gumbo",
674+
"markupever",
633675
]
634676

635677
# Combined header
@@ -726,8 +768,9 @@ def main():
726768
"html.parser",
727769
"selectolax",
728770
"gumbo",
771+
"markupever",
729772
],
730-
default=["justhtml", "html5lib", "lxml", "bs4", "html.parser", "selectolax", "gumbo"],
773+
default=["justhtml", "html5lib", "lxml", "bs4", "html.parser", "selectolax", "gumbo", "markupever"],
731774
help="Parsers to benchmark (default: all)",
732775
)
733776
# MEMORY: options
@@ -785,6 +828,7 @@ def run_with_memory(bench_fn, html_source_factory, iterations):
785828
"html.parser": benchmark_html_parser,
786829
"selectolax": benchmark_selectolax,
787830
"gumbo": benchmark_gumbo,
831+
"markupever": benchmark_markupever,
788832
}
789833

790834
file_count = 0

docs/correctness.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,12 +58,13 @@ We run the same test suite against other Python parsers to compare compliance:
5858
| Parser | Tests Passed | Compliance | Notes |
5959
|--------|-------------|------------|-------|
6060
| **JustHTML** | 1743/1743 | **100%** | Full spec compliance |
61+
| markupever | 1652/1743 | 95% | Rust-based (html5ever), correct |
6162
| html5lib | 1538/1743 | 88% | Reference implementation, but incomplete |
6263
| html5_parser | 1462/1743 | 84% | C-based (Gumbo), mostly correct |
6364
| selectolax | 1187/1743 | 68% | C-based (Lexbor), fast but less compliant |
64-
| BeautifulSoup | 78/1743 | 4% | Uses html.parser, not HTML5 compliant |
65-
| html.parser | 77/1743 | 4% | Python stdlib, basic error recovery only |
66-
| lxml | 13/1743 | 1% | XML-based, not HTML5 compliant |
65+
| BeautifulSoup | 79/1743 | 5% | Uses html.parser, not HTML5 compliant |
66+
| html.parser | 78/1743 | 4% | Python stdlib, basic error recovery only |
67+
| lxml | 44/1743 | 3% | XML-based, not HTML5 compliant |
6768

6869
*Run `python benchmarks/correctness.py` to reproduce these results.*
6970

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ benchmark = [
2121
"beautifulsoup4",
2222
"selectolax",
2323
"html5-parser",
24+
"markupever",
2425
]
2526
dev = [
2627
"ruff==0.14.7",

0 commit comments

Comments
 (0)