usfm_onion is a Rust-first USFM engine built around one canonical working model: flat tokens.
It currently provides:
- parsing and exact token round-trip
- source-faithful CST projection
- token-first lint, format, and diff
- semantic exports to USJ, USX, HTML, and VREF
- a typed Rust facade
- a typed
wasm-packwrapper incrates/usfm_onion_wasm - a shared marker catalog for both Rust and wasm consumers
The design goal is
- parse once
- operate on tokens explicitly
- never silently normalize content on ingest
The engine overview, architecture notes, walker design, and performance snapshots live in docs/usfm-onion.html — open it in any browser. That document is the canonical reference; this README is a quick orientation only.
use usfm_onion::{FormatOptions, HtmlOptions, LintOptions, Usfm};
let doc = Usfm::from_str("\\id GEN\n\\c 1\n\\p\n\\v 1 In the beginning.");
let parsed = doc.parse();
let issues = doc.lint(LintOptions::default());
let usj = doc.to_usj()?;
let usx = doc.to_usx()?;
let html = doc.to_html(HtmlOptions::default());
let formatted = doc.format(FormatOptions::default());
# Ok::<(), Box<dyn std::error::Error>>(())If you already have tokens, use the token facade directly:
use usfm_onion::{FormatOptions, TokenStream, parse::parse};
let parsed = parse("\\id GEN\n\\c 1\n\\p\n\\v 1 In the beginning.");
let mut stream = TokenStream::from_tokens(parsed.tokens);
let formatted_copy = stream.format(FormatOptions::default());
stream.format_mut(FormatOptions::default());
assert!(!formatted_copy.is_empty());parse::parse(source) produces canonical flat tokens plus lightweight analysis.
Use this when you want the exact working representation for:
- lint
- format
- diff
- exact USFM reconstruction
- editor and wasm token flows
cst::parse_cst(source) builds a source-faithful tree over the canonical token stream.
Use this when you want:
- explicit structural nesting
- tree traversal without losing source fidelity
- a tree view that can always flatten back to canonical tokens
Lint is token-first and generic over the minimum lint token surface.
Main entrypoints:
use usfm_onion::lint::{lint_tokens, lint_usfm, LintOptions};Machine-readable lint ids are exposed through LintCode.
Formatting is explicit and opt-in.
format(...)is pureformat_mut(...)is explicitly mutating
Main entrypoints:
use usfm_onion::format::{format, format_mut, format_usfm, FormatOptions};Machine-readable formatter rule ids are exposed through FormatRule.
Diff is token-first and SID-block based.
Main entrypoints:
use usfm_onion::diff::{
diff_chapter_token_streams,
diff_usfm_sources,
diff_usfm_sources_by_chapter,
BuildSidBlocksOptions,
};Available semantic output modules:
usjusxhtmlvref
Typical direct calls:
use usfm_onion::html::{HtmlOptions, usfm_to_html};
use usfm_onion::usj::usfm_to_usj;
use usfm_onion::usx::usfm_to_usx;
use usfm_onion::vref::usfm_to_vref_map;The crate exposes a real marker metadata surface instead of only ad hoc helpers.
use usfm_onion::{marker_catalog, marker_info, is_known_marker};
let catalog = marker_catalog();
let p = marker_info("p");
assert!(catalog.contains("p"));
assert!(is_known_marker("p"));
assert_eq!(p.canonical.as_deref(), Some("p"));Use this when downstream code needs to know:
- whether a marker is valid
- canonical marker identity
- marker category and kind
- note family and note subkind
- inline context
- allowed spec contexts
- default attributes and closing behavior
The wasm wrapper is in crates/usfm_onion_wasm. All public types are tsify-derived — TypeScript declarations come straight from Rust.
The exposed surface is string-in only at construction; token-in entry points exist for the repeated editor operations (lint, format, diff):
parse(source)→ParsedUsfmParsedUsfm.tokens(),.lint(),.format(),.diff(),.toUsj(),.toUsx(),.toHtml(),.toVref()- top-level
lintTokens,formatTokens,formatTokensMut,diffTokensfor the token-in fast path - typed exports:
LintCode,FormatRule,MarkerInfo,UsfmMarkerCatalog
Build it with the root npm scripts:
npm run build:wasm # bundler + web targets, release
npm run build:wasm:bundler:dev # dev build
npm run check:wasm:web # cargo check against wasm32 target
npm run test:wasm # scripts/test-web-package.mjs against both targetsTwo criterion harnesses live in benches/:
operations— string-vs-tokens matrix on a single book (Luke by default)parallelism— serial vsrayonover the full en_ulb corpus
cargo bench --bench operations
cargo bench --bench parallelism
cargo run --release --example bench_report > BENCH_RESULTS.mdDifferent corpora:
USFM_BENCH_CORPORA=examples.bsb cargo bench --bench operations
USFM_BENCH_CORPORA=all cargo bench --bench parallelismSnapshots live at BENCH_RESULTS.md (native) and BENCH_RESULTS_WASM.md (browser).