|
| 1 | +# Custom Encoding Specification for FFI Boundary Crossing |
| 2 | +## Overview |
| 3 | +This document describes the custom encoding format used for serializing the Tag struct in Rust to a `Vec<u8>` for crossing FFI (Foreign Function Interface) boundaries. This encoding format ensures that the data can be efficiently transferred and reconstructed on the other side of the FFI boundary. |
| 4 | + |
| 5 | +## Tag Struct |
| 6 | +The Tag struct contains the following fields: |
| 7 | + |
| 8 | +- open_start: `[u32; 2]` |
| 9 | +- open_end: `[u32; 2]` |
| 10 | +- close_start: `[u32; 2]` |
| 11 | +- close_end: `[u32; 2]` |
| 12 | +- self_closing: `bool` |
| 13 | +- name: `Vec<u8>` |
| 14 | +- attributes: `Vec<Attribute>` |
| 15 | +- text_nodes: `Vec<Text>` |
| 16 | +### Encoding Format |
| 17 | +The encoding format is a binary representation of the Tag struct, with the following layout: |
| 18 | + |
| 19 | +1. ### Header (8 bytes): |
| 20 | + - attributes_start: u32 (4 bytes) - The starting byte offset of the attributes section. |
| 21 | + - text_nodes_start: u32 (4 bytes) - The starting byte offset of the text nodes section. |
| 22 | + |
| 23 | +2. ### Tag Data: |
| 24 | + - open_start: `[u32; 2]` (8 bytes) |
| 25 | + - open_end: `[u32; 2]` (8 bytes) |
| 26 | + - close_start: `[u32; 2]` (8 bytes) |
| 27 | + - close_end: `[u32; 2]` (8 bytes) |
| 28 | + - self_closing: `u8` (1 byte) |
| 29 | + - name_length: `u32` (4 bytes) - The length of the name field. |
| 30 | + - name: `Vec<u8>` (variable length) - The UTF-8 encoded bytes of the tag name. |
| 31 | + |
| 32 | +3. ### Attributes Section: |
| 33 | + - attributes_count: `u32` (4 bytes) - The number of attributes. |
| 34 | + - For each attribute: |
| 35 | + - attribute_length: `u32` (4 bytes) - The length of the encoded attribute. |
| 36 | + - attribute_data: `Vec<u8>`(variable length) - The encoded attribute data. |
| 37 | + |
| 38 | + |
| 39 | +Text Nodes Section: |
| 40 | + |
| 41 | +text_nodes_count: u32 (4 bytes) - The number of text nodes. |
| 42 | +For each text node: |
| 43 | +text_length: u32 (4 bytes) - The length of the encoded text node. |
| 44 | +text_data: Vec<u8> (variable length) - The encoded text node data. |
| 45 | +Encoding Process |
| 46 | +The encoding process involves serializing each field of the Tag struct into a Vec<u8> in the specified order. The following Rust code demonstrates the encoding process: |
| 47 | + |
| 48 | +Encoding Process |
| 49 | +The encoding process involves serializing each field of the Tag struct into a `Vec<u8>` in the specified order. The following Rust code demonstrates the encoding process: |
| 50 | +```rust |
| 51 | +impl Encode<Vec<u8>> for Tag { |
| 52 | + #[inline] |
| 53 | + fn encode(&self) -> Vec<u8> { |
| 54 | + let mut v = vec![0, 0, 0, 0, 0, 0, 0, 0]; |
| 55 | + let name_bytes = self.name.as_slice(); |
| 56 | + |
| 57 | + v.reserve(name_bytes.len() + 37); |
| 58 | + // known byte length - 8 bytes per [u32; 2] |
| 59 | + v.extend_from_slice(u32_to_u8(&self.open_start)); |
| 60 | + v.extend_from_slice(u32_to_u8(&self.open_end)); |
| 61 | + |
| 62 | + v.extend_from_slice(u32_to_u8(&self.close_start)); |
| 63 | + v.extend_from_slice(u32_to_u8(&self.close_end)); |
| 64 | + // bool - 1 byte |
| 65 | + v.push(self.self_closing as u8); |
| 66 | + // length of the name - 4 bytes |
| 67 | + v.extend_from_slice(u32_to_u8(&[self.name.len() as u32])); |
| 68 | + // name_bytes.len() bytes |
| 69 | + v.extend_from_slice(name_bytes); |
| 70 | + |
| 71 | + // write the starting location for the attributes at bytes 0..4 |
| 72 | + v.splice(0..4, u32_to_u8(&[v.len() as u32]).to_vec()); |
| 73 | + // write the number of attributes |
| 74 | + v.extend_from_slice(u32_to_u8(&[self.attributes.len() as u32])); |
| 75 | + // Encode and write the attributes |
| 76 | + for a in &self.attributes { |
| 77 | + let mut attr = a.encode(); |
| 78 | + let len = attr.len(); |
| 79 | + v.reserve(len + 4); |
| 80 | + // write the length of this attribute |
| 81 | + v.extend_from_slice(u32_to_u8(&[len as u32])); |
| 82 | + v.append(&mut attr); |
| 83 | + } |
| 84 | + |
| 85 | + // write the starting location for the text node at bytes 4..8 |
| 86 | + v.splice(4..8, u32_to_u8(&[v.len() as u32]).to_vec()); |
| 87 | + // write the number of text nodes |
| 88 | + v.extend_from_slice(u32_to_u8(&[self.text_nodes.len() as u32])); |
| 89 | + // encode and write the text nodes |
| 90 | + for t in &self.text_nodes { |
| 91 | + let mut text = t.encode(); |
| 92 | + let len = text.len(); |
| 93 | + v.reserve(len + 4); |
| 94 | + // write the length of this text node |
| 95 | + v.extend_from_slice(u32_to_u8(&[len as u32])); |
| 96 | + v.append(&mut text); |
| 97 | + } |
| 98 | + v |
| 99 | + } |
| 100 | +} |
| 101 | +``` |
| 102 | +# Decoding Process |
| 103 | +The decoding process involves reconstructing the Tag struct from the binary representation. The following steps outline the decoding process: |
| 104 | + |
| 105 | +1. Read the header to get the starting offsets for the attributes and text nodes sections. |
| 106 | +2. Read the tag data fields. |
| 107 | +3. Read the attributes section using the starting offset. |
| 108 | +4. Read the text nodes section using the starting offset. |
| 109 | +The decoding process should ensure that the data is read in the same order as it was written during encoding. |
| 110 | + |
| 111 | +# Why (serde-)wasm-bindgen Was Not Used |
| 112 | +While wasm-bindgen is a powerful tool that facilitates high-level interactions between Rust and JavaScript, it was not used in this project for the following performance-related reasons: |
| 113 | + |
| 114 | +1. **Lazy read of individual fields from a Uint8Array**: The decoding strategy does not require crossing the JS-wasm boundary each time a field is read (which is expensive), nor does it need to construct all field values at once on the JS side. Instead, the encoded data is a fixed structure on both the Rust and JS side and resides in linear memory. The data for each field is at a known address within the Uint8Array and can be read lazily via getters. This means that if you received a Tag from the parser and only need to read the `tag.name`, the `name` is decoded at the time it is read while leaving all other fields encoded. Your CPU overhead is lmited to decoding only the fields that are accessed and only at the time they are accessed. |
| 115 | + |
| 116 | +1. **Performance Overhead** : wasm-bindgen introduces significant overhead due to automatic type conversion and memory management. For performance-critical applications, this overhead can impact the overall efficiency of the system. By using a custom encoding format, we can minimize this overhead and achieve better performance. |
| 117 | + |
| 118 | +1. **Fine-Grained Control**: Custom encoding provides fine-grained control over the serialization and deserialization process. This allows for optimizations specific to the application's needs, such as minimizing the size of the encoded data and reducing the number of memory allocations. |
| 119 | + |
| 120 | +1. **Compactness**: The custom encoding format is designed to be compact, reducing memory usage and transmission time. This is particularly important for applications that need to transfer large amounts of data across the FFI boundary. |
| 121 | + |
| 122 | +1. **Avoiding Dependencies**: By not relying on wasm-bindgen, we avoid adding an additional dependency to the project. This can simplify the build process and reduce potential compatibility issues with other tools and libraries. |
0 commit comments