CSS selectors meet streaming XML/HTML parsing. Filter StAX events and build targeted DOMs without loading the entire document.
scah is a high-performance parsing library that bridges the gap between SAX/StAX streaming efficiency and DOM convenience. Instead of loading an entire document into memory or manually tracking parser state, you declare what you want with CSS selectors; the library handles the streaming complexity and builds a targeted DOM containing only your selections.
- Streaming core: Built on StAX; constant memory regardless of document size
- Familiar API: CSS selectors (including combinators like
>,,+(coming soon),~(coming soon)) - Multi-language: Rust core with Python and TypeScript/JavaScript bindings
- Composable queries: Chain selections and nest them with closures for structured querying; not only more efficient than flat filtering, but a fundamentally better pattern for extracting hierarchical data relationships
# Cargo.toml
[dependencies]
scah = "0.0.17"use scah::{Query, Save, parse};
let html = r#"<ul><li><a href="/one">One</a></li><li><a href="/two">Two</a></li></ul>"#;
let queries = &[
Query::all("a[href]", Save::all())
.expect("valid selector")
.build()
];
let store = parse(html, queries);
for a in store.get("a[href]").unwrap() {
let href = a.attribute(&store, "href").unwrap();
let text = a.text_content(&store).unwrap_or_default();
println!("{text}: {href}");
}
// Output:
// One: /one
// Two: /twoInstead of flat filtering, nest queries with closures. Child queries only run within the context of their parent match:
use scah::{Query, Save, parse};
let query = Query::all("main > section", Save::all())
.expect("valid selector")
.then(|section| {
Ok([
section.all("> a[href]", Save::all())?,
section.all("div a", Save::all())?,
])
})
.expect("valid child selectors")
.build();
let queries = [query];
let store = parse(html, &queries);
// Access nested results through parent elements
for section in store.get("main > section").unwrap() {
println!("Section: {}", section.inner_html.unwrap_or(""));
if let Some(links) = section.get(&store, "> a[href]") {
for link in links {
println!("\tDirect link: {}", link.attribute(&store, "href").unwrap());
}
}
}If selectors come from user input, Query::all(...) and Query::first(...) return Result, so malformed selectors surface as SelectorParseError. For fixed selectors in examples or tests, use .expect(...) explicitly if you want panic-on-invalid-selector behavior.
For selectors that are known at compile time, prefer the query! macro. It validates the selector tree during compilation and emits a StaticQuery backed by inline arrays instead of heap-allocated query storage.
use scah::{Save, parse, query};
let html = r#"
<article>
<h1>Title</h1>
<a href="/one">One</a>
<a href="/two">Two</a>
</article>
"#;
let query = query! {
all("article", Save::none()) => {
first("h1", Save::only_text_content()),
all("a[href]", Save::all()),
}
};
let queries = [query];
let store = parse(html, &queries);
let articles = store.get("article").unwrap();
assert_eq!(articles.len(), 1);
for article in articles {
assert_eq!(article.get("a[href]").unwrap().count(), 2);
}Use the runtime builder when you have dynamic sources. Use query! when the selector tree is authored in Rust code and should fail at compile time if it becomes invalid.
Control what data is captured per selector:
| Constructor | inner_html |
text_content |
Use case |
|---|---|---|---|
Save::all() |
Yes | Yes | Full extraction |
Save::only_inner_html() |
Yes | No | Raw markup only |
Save::only_text_content() |
No | Yes | Lightweight text scraping |
Save::none() |
No | No | Structure-only (attributes still saved) |
| Syntax | Example | Status |
|---|---|---|
| Tag name | a, div |
Working |
| ID | #my-id |
Working |
| Class | .my-class |
Working |
| Descendant | main section a |
Working |
| Child | main > section |
Working |
| Attribute presence | a[href] |
Working |
| Attribute exact | a[href="url"] |
Working |
| Attribute prefix | a[href^="https"] |
Working |
| Attribute suffix | a[href$=".com"] |
Working |
| Attribute substring | a[href*="example"] |
Working |
| Adjacent sibling | h1 + p |
Coming soon |
| General sibling | h1 ~ p |
Coming soon |
Full API documentation: docs.rs/scah
The repository includes two Rust benchmark tracks:
- Cross-library comparisons for simple
allandfirstselectors. - Runtime-builder vs
query!macro comparisons to measure query-construction overhead separately from execution.
pip install -U scahfrom scah import Query, Save, parse
query = Query.all("main > section", Save.all())
.then(lambda section: [
section.all("> a[href]", Save.all()),
section.all("div a", Save.all()),
])
.build()
store = parse(html, [query])Real Html BenchMark (html.spec.whatwg.org) (select all a tags):
npm install scah@npm:@zacharymm/scahimport { Query, parse } from 'scah';
const query = Query.all('main > section', { innerHtml: true, textContent: true })
.then((p) => [
p.all('> a[href]', { innerHtml: true, textContent: true }),
p.all('div a', { innerHtml: true, textContent: true }),
])
.build();
const store = parse(html, [query]);









