Skip to content

Commit 54cb8a6

Browse files
committed
Added doc blocks and updated readme
1 parent bdea5f0 commit 54cb8a6

25 files changed

+3367
-283
lines changed

.eslintrc.js

+14-16
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,15 @@
1-
module.exports = {
2-
root: true,
3-
parser: '@typescript-eslint/parser',
4-
plugins: [
5-
'@typescript-eslint'
6-
],
7-
extends: [
8-
'eslint:recommended',
9-
'plugin:@typescript-eslint/recommended'
10-
],
11-
rules: {
12-
'no-bitwise': 'off',
13-
'@typescript-eslint/no-this-alias': 'off',
14-
'@typescript-eslint/no-explicit-any': 'off',
15-
'@typescript-eslint/no-inferrable-types': 'warn'
16-
}
1+
export const root = true;
2+
export const parser = '@typescript-eslint/parser';
3+
export const plugins = [
4+
'@typescript-eslint'
5+
];
6+
export const extendsConfig = [
7+
'eslint:recommended',
8+
'plugin:@typescript-eslint/recommended'
9+
];
10+
export const rules = {
11+
'no-bitwise': 'off',
12+
'@typescript-eslint/no-this-alias': 'off',
13+
'@typescript-eslint/no-explicit-any': 'off',
14+
'@typescript-eslint/no-inferrable-types': 'warn'
1715
};

.github/workflows/ci.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ jobs:
4343
run: npm install
4444

4545
- name: Build and test
46-
run: npm run build && npm run test
46+
run: npm run build && npm run coverage
4747

4848
- name: Cache Node.js modules
4949
uses: actions/cache@v2

ENCODING_SPEC.md

+122
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Custom Encoding Specification for FFI Boundary Crossing
2+
## Overview
3+
This document describes the custom encoding format used for serializing the Tag struct in Rust to a `Vec<u8>` for crossing FFI (Foreign Function Interface) boundaries. This encoding format ensures that the data can be efficiently transferred and reconstructed on the other side of the FFI boundary.
4+
5+
## Tag Struct
6+
The Tag struct contains the following fields:
7+
8+
- open_start: `[u32; 2]`
9+
- open_end: `[u32; 2]`
10+
- close_start: `[u32; 2]`
11+
- close_end: `[u32; 2]`
12+
- self_closing: `bool`
13+
- name: `Vec<u8>`
14+
- attributes: `Vec<Attribute>`
15+
- text_nodes: `Vec<Text>`
16+
### Encoding Format
17+
The encoding format is a binary representation of the Tag struct, with the following layout:
18+
19+
1. ### Header (8 bytes):
20+
- attributes_start: u32 (4 bytes) - The starting byte offset of the attributes section.
21+
- text_nodes_start: u32 (4 bytes) - The starting byte offset of the text nodes section.
22+
23+
2. ### Tag Data:
24+
- open_start: `[u32; 2]` (8 bytes)
25+
- open_end: `[u32; 2]` (8 bytes)
26+
- close_start: `[u32; 2]` (8 bytes)
27+
- close_end: `[u32; 2]` (8 bytes)
28+
- self_closing: `u8` (1 byte)
29+
- name_length: `u32` (4 bytes) - The length of the name field.
30+
- name: `Vec<u8>` (variable length) - The UTF-8 encoded bytes of the tag name.
31+
32+
3. ### Attributes Section:
33+
- attributes_count: `u32` (4 bytes) - The number of attributes.
34+
- For each attribute:
35+
- attribute_length: `u32` (4 bytes) - The length of the encoded attribute.
36+
- attribute_data: `Vec<u8>`(variable length) - The encoded attribute data.
37+
38+
39+
Text Nodes Section:
40+
41+
text_nodes_count: u32 (4 bytes) - The number of text nodes.
42+
For each text node:
43+
text_length: u32 (4 bytes) - The length of the encoded text node.
44+
text_data: Vec<u8> (variable length) - The encoded text node data.
45+
Encoding Process
46+
The encoding process involves serializing each field of the Tag struct into a Vec<u8> in the specified order. The following Rust code demonstrates the encoding process:
47+
48+
Encoding Process
49+
The encoding process involves serializing each field of the Tag struct into a `Vec<u8>` in the specified order. The following Rust code demonstrates the encoding process:
50+
```rust
51+
impl Encode<Vec<u8>> for Tag {
52+
#[inline]
53+
fn encode(&self) -> Vec<u8> {
54+
let mut v = vec![0, 0, 0, 0, 0, 0, 0, 0];
55+
let name_bytes = self.name.as_slice();
56+
57+
v.reserve(name_bytes.len() + 37);
58+
// known byte length - 8 bytes per [u32; 2]
59+
v.extend_from_slice(u32_to_u8(&self.open_start));
60+
v.extend_from_slice(u32_to_u8(&self.open_end));
61+
62+
v.extend_from_slice(u32_to_u8(&self.close_start));
63+
v.extend_from_slice(u32_to_u8(&self.close_end));
64+
// bool - 1 byte
65+
v.push(self.self_closing as u8);
66+
// length of the name - 4 bytes
67+
v.extend_from_slice(u32_to_u8(&[self.name.len() as u32]));
68+
// name_bytes.len() bytes
69+
v.extend_from_slice(name_bytes);
70+
71+
// write the starting location for the attributes at bytes 0..4
72+
v.splice(0..4, u32_to_u8(&[v.len() as u32]).to_vec());
73+
// write the number of attributes
74+
v.extend_from_slice(u32_to_u8(&[self.attributes.len() as u32]));
75+
// Encode and write the attributes
76+
for a in &self.attributes {
77+
let mut attr = a.encode();
78+
let len = attr.len();
79+
v.reserve(len + 4);
80+
// write the length of this attribute
81+
v.extend_from_slice(u32_to_u8(&[len as u32]));
82+
v.append(&mut attr);
83+
}
84+
85+
// write the starting location for the text node at bytes 4..8
86+
v.splice(4..8, u32_to_u8(&[v.len() as u32]).to_vec());
87+
// write the number of text nodes
88+
v.extend_from_slice(u32_to_u8(&[self.text_nodes.len() as u32]));
89+
// encode and write the text nodes
90+
for t in &self.text_nodes {
91+
let mut text = t.encode();
92+
let len = text.len();
93+
v.reserve(len + 4);
94+
// write the length of this text node
95+
v.extend_from_slice(u32_to_u8(&[len as u32]));
96+
v.append(&mut text);
97+
}
98+
v
99+
}
100+
}
101+
```
102+
# Decoding Process
103+
The decoding process involves reconstructing the Tag struct from the binary representation. The following steps outline the decoding process:
104+
105+
1. Read the header to get the starting offsets for the attributes and text nodes sections.
106+
2. Read the tag data fields.
107+
3. Read the attributes section using the starting offset.
108+
4. Read the text nodes section using the starting offset.
109+
The decoding process should ensure that the data is read in the same order as it was written during encoding.
110+
111+
# Why (serde-)wasm-bindgen Was Not Used
112+
While wasm-bindgen is a powerful tool that facilitates high-level interactions between Rust and JavaScript, it was not used in this project for the following performance-related reasons:
113+
114+
1. **Lazy read of individual fields from a Uint8Array**: The decoding strategy does not require crossing the JS-wasm boundary each time a field is read (which is expensive), nor does it need to construct all field values at once on the JS side. Instead, the encoded data is a fixed structure on both the Rust and JS side and resides in linear memory. The data for each field is at a known address within the Uint8Array and can be read lazily via getters. This means that if you received a Tag from the parser and only need to read the `tag.name`, the `name` is decoded at the time it is read while leaving all other fields encoded. Your CPU overhead is lmited to decoding only the fields that are accessed and only at the time they are accessed.
115+
116+
1. **Performance Overhead** : wasm-bindgen introduces significant overhead due to automatic type conversion and memory management. For performance-critical applications, this overhead can impact the overall efficiency of the system. By using a custom encoding format, we can minimize this overhead and achieve better performance.
117+
118+
1. **Fine-Grained Control**: Custom encoding provides fine-grained control over the serialization and deserialization process. This allows for optimizations specific to the application's needs, such as minimizing the size of the encoded data and reducing the number of memory allocations.
119+
120+
1. **Compactness**: The custom encoding format is designed to be compact, reducing memory usage and transmission time. This is particularly important for applications that need to transfer large amounts of data across the FFI boundary.
121+
122+
1. **Avoiding Dependencies**: By not relying on wasm-bindgen, we avoid adding an additional dependency to the project. This can simplify the build process and reduce potential compatibility issues with other tools and libraries.

README.md

+20-28
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,16 @@ Suitable for [LSP](https://langserver.org/) implementations, sax-wasm provides l
1616
document for elements, attributes and text node which provides the raw building blocks for linting, transpilation and lexing.
1717

1818
## Benchmarks (Node v22.12.0 / 2.7 GHz Quad-Core Intel Core i7)
19-
All parsers are tested using a large XML document (1 MB) containing a variety of elements and is streamed when supported
20-
by the parser. This attempts to recreate the best real-world use case for parsing XML. Other libraries test benchmarks using a
21-
very small XML fragment such as `<foo bar="baz">quux</foo>` which does not hit all code branches responsible for processing the
22-
document and heavily skews the results in their favor.
23-
24-
| Parser with Advanced Features | time/ms (lower is better) | JS | Runs in browser |
25-
|--------------------------------------------------------------------------------------------|--------------------------:|:------:|:---------------:|
26-
| [sax-wasm](https://github.com/justinwilaby/sax-wasm) | 19.20 |||
27-
| [sax-js](https://github.com/isaacs/sax-js) | 64.23 ||* |
28-
| [ltx](https://github.com/xmppjs/ltx) | 21.54 |||
29-
| [node-xml](https://github.com/dylang/node-xml) | 87.06 |||
19+
All parsers are tested using a large XML document (3 MB) containing a variety of elements and is streamed from memory to remove variations in disk access latency and focus on benchmarking just the parser alone. Other libraries test benchmarks using a very small XML fragment such as `<foo bar="baz">quux</foo>` which does not hit all code branches responsible for processing the document and heavily skews the results in their favor.
20+
21+
| Parser with Advanced Features | time/ms (lower is better)| JS | Runs in browser |
22+
|--------------------------------------------------------------------------------------------|-------------------------:|:------:|:---------------:|
23+
| [sax-wasm](https://github.com/justinwilaby/sax-wasm) | 0.466 |||
24+
| [saxes](https://github.com/lddubeau/saxes) | 0.868 |||
25+
| [ltx(using Saxes as the parser)](https://github.com/xmppjs/ltx) | 0.881 |||
26+
| [node-xml](https://github.com/dylang/node-xml) | 1.549 |||
27+
| [node-expat](https://github.com/xmppo/node-expat) | 1.551 |||
28+
| [sax-js](https://github.com/isaacs/sax-js) | 1.869 ||* |
3029
<sub>*built for node but *should* run in the browser</sub>
3130

3231
## Installation
@@ -40,25 +39,17 @@ import path from 'path';
4039
import { fileURLToPath } from 'url';
4140
import { SaxEventType, SAXParser } from 'sax-wasm';
4241

43-
// Get the path to the WebAssembly binary and load it
44-
const __filename = fileURLToPath(import.meta.url);
45-
const __dirname = path.dirname(__filename);
46-
const saxPath = path.resolve(__dirname, 'node_modules/sax-wasm/lib/sax-wasm.wasm');
47-
const saxWasmBuffer = fs.readFileSync(saxPath);
42+
const wasmUrl = new URL(import.meta.resolve('sax-wasm/lib/sax-wasm.wasm'));
43+
const saxWasm = await readFile(wasmUrl);
44+
const parser = new SAXParser(SaxEventType.Cdata | SaxEventType.OpenTag);
4845

49-
// Instantiate
50-
const parser = new SAXParser(SaxEventType.Attribute | SaxEventType.OpenTag);
51-
52-
// Instantiate and prepare the wasm for parsing
53-
const ready = await parser.prepareWasm(saxWasmBuffer);
54-
if (ready) {
55-
// stream from a file in the current directory
56-
const readable = fs.createReadStream(path.resolve(__dirname, 'path/to/document.xml'), options);
46+
if (await parser.prepareWasm(saxWasm)) {
47+
const xmlPath = import.meta.resolve('../src/xml.xml');
48+
const readable = createReadStream(new URL(xmlPath));
5749
const webReadable = Readable.toWeb(readable);
58-
5950
for await (const [event, detail] of parser.parse(webReadable.getReader())) {
60-
if (event === SaxEventType.Attribute) {
61-
// process attribute
51+
if (event === SaxEventType.Cdata) {
52+
// process Cdata
6253
} else {
6354
// process open tag
6455
}
@@ -75,7 +66,8 @@ under the hood to load the wasm.
7566
import { SaxEventType, SAXParser } from 'sax-wasm';
7667

7768
// Fetch the WebAssembly binary
78-
const response = fetch('path/to/sax-wasm.wasm');
69+
const wasmUrl = new URL(import.meta.resolve('sax-wasm/lib/sax-wasm.wasm'));
70+
const response = fetch(wasmUrl);
7971

8072
// Instantiate
8173
const parser = new SAXParser(SaxEventType.Attribute | SaxEventType.OpenTag);

jest.config.js

+5-7
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
1-
/** @type {import('ts-jest/dist/types').InitialOptionsTsJest} */
2-
module.exports = {
3-
preset: 'ts-jest',
4-
testEnvironment: 'node',
5-
coverageProvider: 'v8',
6-
collectCoverage: true
7-
};
1+
/** @type {import('ts-jest/dist/types').DefaultEsmPreset} */
2+
export const preset = 'ts-jest';
3+
export const testEnvironment = 'node';
4+
export const coverageProvider = 'v8';
5+
export const collectCoverage = true;

0 commit comments

Comments
 (0)