Skip to content

Latest commit

 

History

History
582 lines (428 loc) · 20.4 KB

File metadata and controls

582 lines (428 loc) · 20.4 KB

RustyXML Architecture

A purpose-built Rust NIF for ultra-fast XML parsing in Elixir. Not a wrapper around an existing library—custom-built from the ground up for optimal BEAM integration with full XPath 1.0 support. Drop-in replacement for both SweetXml and Saxy.

Key Innovations

Purpose-Built, Not Wrapped

Unlike projects that wrap existing Rust crates (like quick-xml or roxmltree), RustyXML is designed specifically for Elixir:

  • Direct BEAM term construction — Results go straight to Erlang terms, no intermediate serialization
  • ResourceArc integration — Documents and streaming parser state managed by BEAM's garbage collector
  • Dirty scheduler awareness — All raw-XML parse NIFs run on dirty CPU schedulers
  • Zero-copy where possible — Span-based references into original input, only allocates for entity decoding
  • Structural index — Cache-friendly storage with compact span structs and flat arrays

Unified Architecture

RustyXML v0.2.0 consolidated multiple parsing strategies into a single optimized path: the structural index. A single UnifiedScanner tokenizes input once, dispatching to a ScanHandler trait that builds the appropriate representation:

Path Description Best For
parse/1 + xpath/2 Structural index with XPath General XML processing
stream_tags/3 Bounded-memory streaming Large files (GB+)
sax_parse/1 SAX event collection Event-driven processing

All three paths share the same SIMD-accelerated scanner and well-formedness validation.

Memory Efficiency

  • Structural index — Elements stored as compact span structs (32 bytes each) referencing the original input
  • Zero-copy strings — Tag names, attribute values, and text stored as (offset, length) spans
  • Sub-binary returns — BEAM sub-binaries share memory with the original input
  • Streaming bounded memory — Process 10GB+ files with ~128 KB combined NIF + BEAM peak via zero-copy tokenization and direct BEAM binary encoding
  • mimalloc allocator — High-performance allocator for reduced fragmentation
  • Optional memory tracking — Opt-in profiling with zero overhead when disabled

Validated Correctness

  • 100% W3C/OASIS XML Conformance — All 1089 applicable tests pass (218 valid + 871 not-well-formed rejections), verified individually against the official xmlconf suite
  • 1296+ tests including the full conformance suite, batch accessor clamping, and lazy XPath coverage
  • Cross-path validation — All paths produce consistent output
  • SweetXml compatibility — Verified identical behavior for common API patterns

Quick Start

import RustyXML

xml = """
<catalog>
  <book id="1"><title>Elixir in Action</title><price>45.00</price></book>
  <book id="2"><title>Programming Phoenix</title><price>50.00</price></book>
</catalog>
"""

# Get all books
RustyXML.xpath(xml, ~x"//book"l)

# Get text content
RustyXML.xpath(xml, ~x"//title/text()"s)

# Extract multiple values
RustyXML.xmap(xml, [
  titles: ~x"//title/text()"sl,
  prices: ~x"//price/text()"sl
])

Core Architecture

UnifiedScanner and ScanHandler

The UnifiedScanner is the single entry point for all XML tokenization. It uses memchr-based SIMD scanning to find delimiters, then dispatches events through the ScanHandler trait:

XML Input
   |
   v
UnifiedScanner (memchr SIMD tokenization)
   |
   +---> IndexBuilder (ScanHandler) ---> StructuralIndex ---> XPath
   |
   +---> SaxCollector (ScanHandler) ---> SAX Events
   |
   +---> StreamingParser ---> Complete Elements

The ScanHandler trait:

trait ScanHandler {
    fn start_element(&mut self, name: Span, attrs: &[(Span, Span)], is_empty: bool);
    fn end_element(&mut self, name: Span);
    fn text(&mut self, span: Span, needs_entity_decode: bool);
    fn cdata(&mut self, span: Span);
    fn comment(&mut self, span: Span);
    fn processing_instruction(&mut self, target: Span, data: Option<Span>);
}

Adding a new processing mode requires only implementing the trait—no changes to the scanner.

Structural Index

The structural index is the core document representation. Instead of building a DOM tree with string copies, it stores compact structs that reference byte offsets into the original input:

struct Span {
    offset: u32,
    len: u16,     // 6 bytes total
}

struct IndexElement {      // 32 bytes
    name: Span,
    ns_prefix: Option<Span>,
    parent: u32,
    children: Range<u32>,  // into flat children_data array
    attrs: Range<u32>,     // into flat attrs array
}

struct IndexText {         // 16 bytes
    span: Span,
    parent: u32,
    needs_entity_decode: bool,
}

struct IndexAttribute {    // 12 bytes
    name: Span,
    value: Span,
}

Memory profile for 2.93 MB document:

  • Structural index: 12.8 MB (4.4x input size)
  • Old DOM approach: 30.2 MB (10.3x input size)
  • SweetXml/xmerl: allocated entirely on BEAM heap

The IndexedDocumentView implements the DocumentAccess trait, allowing the XPath engine to evaluate queries on the structural index without any conversion step.

SIMD-Accelerated Scanning

Tag and content boundary detection uses memchr for hardware-accelerated scanning:

use memchr::{memchr, memchr2, memchr3};

// Find next tag start — SIMD accelerated
fn find_tag_start(input: &[u8], pos: usize) -> Option<usize> {
    memchr(b'<', &input[pos..]).map(|i| pos + i)
}

// Content scanning for entities and markup
fn find_content_break(input: &[u8], pos: usize) -> Option<usize> {
    memchr3(b'<', b'&', b']', &input[pos..])
}

SIMD support: SSE2 (x86_64 default), AVX2 (runtime detect), NEON (aarch64), simd128 (wasm)


Parsing

Standard Parse (parse/1)

All parsing flows through the structural index:

doc = RustyXML.parse("<root><item id=\"1\"/></root>")
RustyXML.xpath(doc, ~x"//item/@id"s)
#=> "1"

Best for: Multiple XPath queries on the same document.

Architecture:

  • UnifiedScanner tokenizes input with SIMD-accelerated scanning
  • IndexBuilder collects spans into a StructuralIndex
  • Document wrapped in ResourceArc for BEAM garbage collection
  • XPath queries operate on the structural index via DocumentAccess trait

Direct XPath (xpath/2 with raw XML)

Parse and query in a single call:

RustyXML.xpath("<root><item/></root>", ~x"//item"l)

Best for: Single-query scenarios, avoids persistent document reference.

Streaming Parser (stream_tags/3)

Bounded-memory streaming for large files:

# High-level API
"large_file.xml"
|> RustyXML.stream_tags(:item)
|> Stream.each(fn {:item, item_xml} ->
  name = RustyXML.xpath(item_xml, ~x"./name/text()"s)
  IO.puts("Processing: #{name}")
end)
|> Stream.run()

# Works with Stream.take (no hanging like SweetXml issue #97)
"large_file.xml"
|> RustyXML.stream_tags(:item)
|> Stream.take(10)
|> Enum.to_list()

Best for: Large files (GB+), network streams, memory-constrained environments.

Features:

  • Returns {tag_atom, xml_string} tuples compatible with SweetXml
  • Complete XML elements that can be queried with xpath/2
  • Handles elements split across chunk boundaries
  • Tag filtering emits only matching elements and their children
  • Does NOT hang with Stream.take (fixes SweetXml issue #97)

SAX Parser (sax_parse/1)

Event-based parsing for custom processing:

events = RustyXML.Native.sax_parse(xml)
# Returns list of SAX events: start_element, end_element, text, etc.

Best for: Event-driven processing, custom document handling.

Lazy XPath (xpath_lazy/2)

Keep XPath results in Rust memory, access on-demand:

doc = RustyXML.parse(large_xml)

# Execute query — returns reference, not data
result = RustyXML.Native.xpath_lazy(doc, "//item")

# Access count without building terms (3x faster than regular XPath)
count = RustyXML.Native.result_count(result)

# Batch accessors for multiple items
texts = RustyXML.Native.result_texts(result, 0, 10)
ids = RustyXML.Native.result_attrs(result, "id", 0, 10)

# Extract multiple fields at once
data = RustyXML.Native.result_extract(result, 0, 10, ["id", "category"], true)
#=> [%{:name => "item", :text => "...", "id" => "1", "category" => "cat1"}, ...]

Best for: Large result sets, partial access, count-only queries.

Parallel XPath (xpath_parallel/2)

Execute multiple XPath queries concurrently using Rayon:

doc = RustyXML.parse(large_xml)
results = RustyXML.Native.xpath_parallel(doc, ["//item", "//price", "//title"])

Best for: Batch queries, xmap with many keys.


XPath 1.0 Engine

Full XPath 1.0 implementation with recursive descent parsing:

  • All 13 axes: child, parent, self, attribute, descendant, descendant-or-self, ancestor, ancestor-or-self, following, following-sibling, preceding, preceding-sibling, namespace
  • 27+ functions: position, last, count, local-name, namespace-uri, name, string, concat, starts-with, contains, substring, substring-before, substring-after, string-length, normalize-space, translate, boolean, not, true, false, lang, number, sum, floor, ceiling, round
  • Predicates: Full predicate support with position, boolean, and comparison expressions
  • Operators: Arithmetic (+, -, *, div, mod), comparison (=, !=, <, >, <=, >=), logical (and, or)

Expression Caching

Compiled XPath expressions are cached in an LRU cache (256 entries). Repeated queries skip parsing and compilation entirely.

Fast-Path Predicates

Common predicate patterns are optimized:

  • [@attr='value']PredicateAttrEq (direct attribute lookup)
  • [n]PredicatePosition (index access, no iteration)

Text Extraction Fast Path

For text extraction queries, xpath_text_list extracts text directly from NodeSets without building recursive BEAM element tuples—eliminating the double-walk where tuples were built then discarded.


Project Structure

native/rustyxml/src/
├── lib.rs                 # NIF entry points, memory tracking, mimalloc
├── core/
│   ├── mod.rs             # Re-exports
│   ├── scanner.rs         # SIMD byte scanning (memchr)
│   ├── unified_scanner.rs # UnifiedScanner + ScanHandler trait
│   ├── tokenizer.rs       # State machine tokenizer
│   ├── entities.rs        # Entity decoding with Cow
│   └── attributes.rs      # Attribute parsing
├── index/
│   ├── mod.rs             # Module docs, re-exports
│   ├── structural.rs      # StructuralIndex (main data structure)
│   ├── span.rs            # Span struct (offset, length)
│   ├── element.rs         # IndexElement, IndexText, IndexAttribute
│   ├── builder.rs         # IndexBuilder (ScanHandler impl)
│   └── view.rs            # IndexedDocumentView (DocumentAccess impl)
├── dom/
│   ├── mod.rs             # DocumentAccess trait, validation
│   ├── document.rs        # Document types
│   ├── node.rs            # Node types
│   └── strings.rs         # String utilities
├── xpath/
│   ├── mod.rs             # XPath exports
│   ├── lexer.rs           # XPath tokenizer
│   ├── parser.rs          # Recursive descent parser
│   ├── compiler.rs        # Expression compiler
│   ├── eval.rs            # Evaluation engine
│   ├── axes.rs            # All 13 XPath axes
│   ├── functions.rs       # 27+ XPath 1.0 functions
│   └── value.rs           # XPath value types
├── sax/
│   ├── mod.rs             # SAX module docs
│   ├── events.rs          # CompactSaxEvent types
│   └── collector.rs       # SaxCollector (ScanHandler impl)
├── strategy/
│   ├── mod.rs             # Strategy exports
│   ├── streaming.rs       # Stateful streaming parser
│   └── parallel.rs        # Parallel XPath (DirtyCpu)
├── term.rs                # BEAM term building utilities
└── resource.rs            # ResourceArc wrappers

lib/
├── rusty_xml.ex           # Main module: xpath/2, xmap/2, stream_tags/3, parse_string/4,
│                          #   parse_stream/4, stream_events/2, encode!/2, ~x sigil
├── rusty_xml/
│   ├── native.ex          # NIF bindings (RustlerPrecompiled)
│   ├── streaming.ex       # High-level streaming interface
│   ├── handler.ex         # SAX handler behaviour (= Saxy.Handler)
│   ├── event_transformer.ex # Native event → Saxy event mapping
│   ├── partial.ex         # Incremental SAX parsing (= Saxy.Partial)
│   ├── simple_form.ex     # Tuple tree output (= Saxy.SimpleForm)
│   ├── xml.ex             # Builder DSL (= Saxy.XML)
│   ├── encoder.ex         # XML string encoding
│   └── builder.ex         # Struct→XML protocol (= Saxy.Builder)

Performance Optimizations

Optimization Impact
Structural index (zero-copy spans) 65-70% memory reduction vs old DOM
XPath text fast path 0.74x → 1.44x faster text extraction
XML string serialization 1.39x faster element queries
Complete elements streaming 3.87x faster streaming
Lazy XPath API 3x faster for partial access
XPath expression caching Skip re-parsing repeated queries
Fast-path predicates 23% faster for [@attr='value']
Compile-time atoms Eliminates per-call atom lookup
Direct binary encoding Faster string-to-term conversion
DocumentAccess trait O(1) pre-parsed access
HashSet deduplication O(n^2) → O(n) for node sets

Bypassing BEAM Term Construction

For element queries, building nested Elixir tuples ({:element, name, attrs, children}) is expensive. xpath_query_raw/2 bypasses this by serializing nodes to XML strings in Rust using an iterative approach with an explicit stack.

Lazy XPath

The regular XPath API builds BEAM terms for all results upfront. The lazy API keeps results in Rust memory as Vec<NodeId>:

# Regular API: builds 1000 BEAM tuples immediately
items = RustyXML.xpath(doc, "//item")  # 104ms

# Lazy API: keeps node IDs in Rust, builds terms on-demand
result = RustyXML.Native.xpath_lazy(doc, "//item")  # 31ms
count = RustyXML.Native.result_count(result)  # instant

Zero-Copy with Cow

Entity decoding uses Cow<[u8]> for optimal allocation:

pub fn decode_text(input: &[u8]) -> Cow<'_, [u8]> {
    if memchr(b'&', input).is_none() {
        return Cow::Borrowed(input);  // Zero-copy!
    }
    Cow::Owned(decode_entities(input))
}

Memory Management

mimalloc Allocator

RustyXML uses mimalloc as the default allocator:

#[cfg(feature = "mimalloc")]
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;

Benefits:

  • 10-20% faster allocation for many small objects
  • Reduced fragmentation
  • No tracking overhead in default configuration

Optional Memory Tracking

For profiling, enable the memory_tracking feature:

# In native/rustyxml/Cargo.toml
[features]
default = ["mimalloc", "memory_tracking"]

When enabled:

  • RustyXML.Native.get_rust_memory/0 — Current allocation
  • RustyXML.Native.get_rust_memory_peak/0 — Peak allocation
  • RustyXML.Native.reset_rust_memory_stats/0 — Reset and get stats

Pre-allocated Vectors

All parsing paths pre-allocate vectors with capacity estimates based on input size, reducing reallocation overhead during parsing.


NIF Safety

The 1ms Rule

NIFs should complete in under 1ms to avoid blocking schedulers.

Approach Used By Description
Dirty Schedulers parse, parse_strict, parse_and_xpath, xpath_with_subspecs, xpath_string_value, sax_parse Runs on dirty CPU scheduler
Chunked Processing streaming_* Returns control between chunks
Stateful Resource streaming_* Lets Elixir control iteration
Fast SIMD all paths Completes quickly via hardware acceleration

Memory Safety

  • Documents wrapped in ResourceArc with automatic cleanup
  • Streaming parsers use Mutex<StreamingParser> for thread safety
  • All allocations tracked when memory_tracking enabled

Panic Safety

RustyXML is designed to never crash the BEAM VM:

  • No .unwrap() in NIF code paths — All fallible operations use proper error handling
  • Pre-defined atoms — Common atoms (ok, error, nil, text, name) created at compile time
  • Graceful mutex handling — Poisoned mutexes return {:error, :mutex_poisoned} tuples

Atom Table Safety

BEAM's atom table has a fixed limit (~1M atoms) and atoms are never garbage collected. RustyXML uses binary keys for user-provided values:

# Safe: predefined atom keys + binary attribute keys
%{:name => "item", :text => "...", "id" => "1", "category" => "cat1"}
Key Type Implementation Safe?
:name, :text, :error Pre-defined atoms Fixed set
User attribute names Binary strings No atom table impact

The ~x Sigil

Modifier Effect Example
s Return as string ~x"//title/text()"s
l Return as list ~x"//item"l
e Decode entities ~x"//content"e
o Optional (nil on missing) ~x"//optional"o
i Cast to integer ~x"//count"i
f Cast to float ~x"//price"f
k Return as keyword list ~x"//item"k

Modifiers can be combined: ~x"//items"slo (string, list, optional)


API Compatibility

RustyXML is a drop-in replacement for both SweetXml and Saxy. Both APIs coexist with no conflicts (different arities and function names).

SweetXml-Compatible

Function Description Status
xpath/2,3 Execute XPath query Complete
xmap/2,3 Extract multiple values Complete
~x sigil XPath with modifiers Complete
stream_tags/2,3 Stream specific tags Complete

Saxy-Compatible

Function / Module Description Status
parse_string/4 SAX parsing with handler Complete
parse_stream/4 Streaming SAX with handler Complete
stream_events/2 Lazy stream of SAX events Complete
encode!/2 XML encoding Complete
RustyXML.Handler Handler behaviour (= Saxy.Handler) Complete
RustyXML.Partial Incremental parsing (= Saxy.Partial) Complete
RustyXML.SimpleForm Tuple tree (= Saxy.SimpleForm) Complete
RustyXML.XML Builder DSL (= Saxy.XML) Complete
RustyXML.Builder Struct→XML protocol (= Saxy.Builder) Complete

Migration

# From SweetXml — just change the import
import RustyXML  # was: import SweetXml

# From Saxy — just change the module name
RustyXML.parse_string(xml, MyHandler, [])  # was: Saxy.parse_string(...)
RustyXML.SimpleForm.parse_string(xml)      # was: Saxy.SimpleForm.parse_string(...)

Benchmark Results

See BENCHMARK.md for detailed performance comparisons.

vs Saxy (fairest comparison — both are properly bounded streaming parsers):

  • SAX parsing: ~1.3-1.8x faster
  • SimpleForm: ~1.3-1.5x faster
  • Streaming memory: comparable (~130 KB vs ~125 KB; varies between runs)

vs SweetXml/xmerl:

  • Parsing: 8-72x faster
  • XPath queries: 1.5-3.7x faster
  • Parse memory: significantly less (different measurement methods; see BENCHMARK.md)
  • Streaming: 16x faster (SweetXml streaming is unbounded due to xmerl accumulator)

Compliance & Validation

See COMPLIANCE.md for full details.

  • W3C/OASIS Conformance Suite — 100% compliance (1089/1089 tests pass)
  • W3C XML 1.0 (Fifth Edition) — Full strict mode validation
  • XPath 1.0 Specification — Full axis and function support (13 axes, 27+ functions)

References