Skip to content

Commit 59d1f07

Browse files
committed
Add Aligned Typed Arrays Proposal and Framing Header Proposal
1 parent 5536701 commit 59d1f07

2 files changed

Lines changed: 270 additions & 0 deletions

File tree

Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
# BEVE Proposal: Aligned Typed Arrays for Zero-Copy Access
2+
3+
**Status:** Working Draft
4+
5+
## Motivation
6+
7+
BEVE's typed arrays store contiguous numerical data (floats, integers, etc.) in a compact layout that is already close to the native in-memory representation. However, the current specification does not guarantee that the data payload of a typed array begins at a memory address that satisfies the alignment requirement of the element type. Without alignment, a decoder must copy the data into a suitably aligned buffer before it can be reinterpreted as a native span of `float`, `double`, `int32_t`, etc.
8+
9+
On modern hardware, unaligned access is either a performance penalty or an outright fault. By introducing optional alignment padding, a decoder that holds the entire BEVE message in a contiguous, aligned buffer can hand back a `std::span<T>` (or equivalent) that points directly into the message buffer — **zero copies, zero allocations**.
10+
11+
### Design Goals
12+
13+
1. **Zero-copy typed arrays** — typed array data can be reinterpreted in-place as `span<T>` where `T` is the element type.
14+
2. **Deterministic padding** — the padding length is computable from the element type and the current byte offset; it does not need to be stored explicitly.
15+
3. **Contiguous memory requirement** — the entire BEVE message from its start up to and including any aligned typed array must reside in a single contiguous buffer.
16+
4. **Composability** — any extension that embeds a typed array (matrices, complex numbers, timestamps) gains zero-copy support automatically.
17+
5. **Backward compatibility** — standard BEVE decoders that predate this proposal will encounter a clean failure (unknown sub-type), not silent misinterpretation.
18+
19+
## Byte Offset Origin
20+
21+
**Byte 0 is the first byte of the message buffer.** All offset calculations for alignment padding are relative to this origin. If a framing header (Extension 5) is present, it occupies bytes 0–1 and the root value begins at byte offset 2. If no framing header is present, the root value begins at byte offset 0.
22+
23+
Both encoder and decoder inherently track their position from the start of the message buffer, so they always agree on byte offsets regardless of whether a framing header is present.
24+
25+
## Buffer Alignment Requirement
26+
27+
For zero-copy access, the memory buffer that holds the BEVE message **must** be aligned to at least the maximum alignment of any typed array element in the message. In practice, standard memory allocators on 64-bit systems return 16-byte aligned memory, which covers all standard types up to `int128_t` / `float128_t`.
28+
29+
If the buffer address is aligned to `A` and the data payload of a typed array begins at byte offset `O` where `O % alignof(T) == 0`, then the absolute address of the payload is aligned to `alignof(T)`.
30+
31+
## Aligned Typed Arrays — Built Into the Typed Array Tag
32+
33+
Rather than consuming an extension ID, aligned typed arrays are encoded as a new sub-type within the existing typed array category 3 (boolean/string). This approach means that any BEVE extension that embeds a typed array — matrices, complex numbers, timestamps — gains zero-copy alignment support automatically, with no changes to those extensions.
34+
35+
### Background: Typed Array Category 3
36+
37+
In the current specification, typed array category 3 (bits 3–4 = `11`) uses bit 5 to distinguish between two sub-types:
38+
39+
```
40+
0 -> boolean 0b00'0'11'100
41+
1 -> string 0b01'0'11'100
42+
```
43+
44+
Bits 6–7 are unused and must be zero. This proposal defines a third sub-type.
45+
46+
### Sub-Type 2: Aligned Numeric Array
47+
48+
When bits 5–7 of a typed array header encode the value `2` (bit 6 set, bits 5 and 7 clear), the typed array is an **aligned numeric array**:
49+
50+
```
51+
2 -> aligned 0b010'11'100 → 0x3C
52+
```
53+
54+
The next byte is a **numeric typed array header** — identical to a standard BEVE typed array header for a numeric type. This second header byte encodes the element category (floating point, signed integer, or unsigned integer) and the byte count, using the same bit layout as a normal typed array header byte. The decoder already knows how to parse this; it simply reads it from the second byte instead of the first.
55+
56+
### Layout
57+
58+
```
59+
TYPED_ARRAY_HEADER(aligned) | NUMERIC_HEADER | SIZE | PADDING | DATA
60+
```
61+
62+
Where:
63+
64+
- `TYPED_ARRAY_HEADER(aligned)` — 1 byte (`0x3C`), a typed array header with category 3, sub-type 2, indicating an aligned numeric array.
65+
- `NUMERIC_HEADER` — 1 byte, a standard typed array header encoding the element category (bits 3–4: 0=float, 1=signed, 2=unsigned) and byte count (bits 5–7). Bits 0–2 **must** be `0b100` (the typed array type tag); decoders **must** reject the message if they are not. This is the same byte you would write for a non-aligned typed array of the same element type.
66+
- `SIZE` — a compressed unsigned integer giving the number of elements (same semantics as standard typed arrays).
67+
- `PADDING` — 0 to `(alignment - 1)` bytes, inserted so that the first byte of `DATA` falls at a byte offset from the message origin that is a multiple of the element alignment. The contents of padding bytes are unspecified; decoders **must** ignore them.
68+
- `DATA` — the raw element data, identical to a standard typed array payload.
69+
70+
### Alignment Calculation
71+
72+
Given:
73+
74+
- `offset_after_size` — the byte offset (from byte 0 of the message buffer) of the first byte after the `SIZE` field.
75+
- `alignment` — the natural alignment of the element type in bytes (equal to the element size for all standard numeric types).
76+
77+
The number of padding bytes is:
78+
79+
```
80+
padding = (alignment - (offset_after_size % alignment)) % alignment
81+
```
82+
83+
This value is deterministic. The encoder inserts exactly this many bytes; the decoder computes the same value and skips them.
84+
85+
### Alignment Values by Element Type
86+
87+
| Element Type | Element Size | Required Alignment |
88+
|---|---|---|
89+
| `bfloat16_t` | 2 | 2 |
90+
| `float16_t` | 2 | 2 |
91+
| `float32_t` | 4 | 4 |
92+
| `float64_t` | 8 | 8 |
93+
| `float128_t` | 16 | 16 |
94+
| `int8_t` / `uint8_t` | 1 | 1 (no padding needed) |
95+
| `int16_t` / `uint16_t` | 2 | 2 |
96+
| `int32_t` / `uint32_t` | 4 | 4 |
97+
| `int64_t` / `uint64_t` | 8 | 8 |
98+
| `int128_t` / `uint128_t` | 16 | 16 |
99+
100+
Note: 1-byte element types trivially satisfy alignment and never require padding. Implementations may use standard typed arrays for single-byte elements, as there is no alignment benefit.
101+
102+
### Restrictions
103+
104+
- The entire message from byte 0 through the end of the aligned typed array's `DATA` **must** reside in contiguous memory.
105+
- The `NUMERIC_HEADER` **must** encode a numeric type (category 0, 1, or 2). Encoders **must not** write an aligned typed array with a boolean or string header. Decoders **must** reject such combinations. Boolean arrays are bit-packed and string arrays have variable-length elements, so alignment is not meaningful for these types.
106+
107+
## Decoding Procedure
108+
109+
1. **Begin at byte offset 0 of the message buffer.** If a framing header is present, decode it and advance the offset accordingly. Track the current byte offset throughout decoding.
110+
2. **Decode the root VALUE normally**, tracking the current byte offset at each point.
111+
3. **Upon encountering a typed array with category 3, sub-type 2 (aligned):**
112+
a. Read the `NUMERIC_HEADER` byte to determine element type and size.
113+
b. Read the `SIZE` compressed unsigned integer to get the element count.
114+
c. Record `offset_after_size` — the current byte offset.
115+
d. Compute `padding = (alignment - (offset_after_size % alignment)) % alignment`.
116+
e. Skip `padding` bytes.
117+
f. The next `element_count * element_size` bytes are the data payload, **already aligned**. Return a pointer/span directly into the buffer.
118+
119+
## Encoding Procedure
120+
121+
1. **Begin at byte offset 0 of the message buffer.** If writing a framing header, do so first and advance the offset accordingly. Track byte offsets throughout encoding.
122+
2. **Encode the root VALUE normally**, tracking offsets.
123+
3. **When encoding a typed array that should be aligned:**
124+
a. Write the `TYPED_ARRAY_HEADER(aligned)` byte (`0x3C`).
125+
b. Write the `NUMERIC_HEADER` byte (same as a standard numeric typed array header).
126+
c. Write the `SIZE` compressed unsigned integer.
127+
d. Compute `padding = (alignment - (current_offset % alignment)) % alignment`.
128+
e. Write `padding` bytes (contents are unspecified; zero is conventional).
129+
f. Write the raw element data.
130+
131+
## Worked Example
132+
133+
Consider encoding a message containing a single aligned `float64_t` typed array with 3 elements: `[1.0, 2.0, 3.0]`, without a framing header.
134+
135+
```
136+
Offset Bytes Description
137+
------ ----- -----------
138+
0 3C TYPED_ARRAY_HEADER: aligned typed array
139+
(0b010'11'100: category=3, sub-type=2=aligned)
140+
1 64 NUMERIC_HEADER: float64 typed array
141+
(0b011'00'100: byte_count=3→8 bytes, float, typed array)
142+
2 0C SIZE: 3 elements (3 << 2 | 0 = 0x0C, 1-byte compressed uint)
143+
3 xx xx xx xx xx PADDING: 5 bytes (contents unspecified)
144+
(alignment=8, offset_after_size=3, padding=(8-3%8)%8=5)
145+
8 00 00 00 00 DATA[0]: 1.0 as float64 little-endian
146+
00 00 F0 3F
147+
16 00 00 00 00 DATA[1]: 2.0 as float64 little-endian
148+
00 00 00 40
149+
24 00 00 00 00 DATA[2]: 3.0 as float64 little-endian
150+
00 00 08 40
151+
------
152+
Total: 32 bytes
153+
```
154+
155+
The data begins at offset 8, which is a multiple of 8 (`alignof(float64_t)`). If the buffer itself is 8-byte aligned, the decoder can return a `span<double>` pointing at buffer offset 8 with no copy.
156+
157+
## Composability with Existing Extensions
158+
159+
Because alignment is a property of the typed array itself, every extension that embeds a typed array benefits automatically.
160+
161+
### Matrices (Extension 2)
162+
163+
A matrix stores its data as a typed array. By using an aligned typed array as the inner `VALUE`, the matrix data payload is automatically aligned:
164+
165+
```
166+
EXT(2) | MATRIX_HEADER | EXTENTS | ALIGNED_TYPED_ARRAY
167+
```
168+
169+
No changes to the matrix extension are required.
170+
171+
### Complex Numbers (Extension 3)
172+
173+
Complex arrays store pairs of numerical values in a typed array. Using an aligned typed array as the inner data automatically aligns the complex data:
174+
175+
```
176+
EXT(3) | COMPLEX_HEADER | SIZE | ALIGNED_TYPED_ARRAY_DATA
177+
```
178+
179+
No changes to the complex number extension are required.
180+
181+
## Nested / Multiple Aligned Arrays
182+
183+
A message may contain multiple aligned typed arrays (for example, as values in an object). Each one computes its own padding independently based on its offset from byte 0. The contiguous-memory requirement applies to the entire message.
184+
185+
Because the headers, sizes, and keys between typed arrays will vary in length, each aligned typed array may have a different amount of padding. This is expected and correct.
186+
187+
## Impact on Message Size
188+
189+
An aligned typed array uses one extra byte compared to a standard typed array (the additional `NUMERIC_HEADER` byte), plus at most `alignment - 1` bytes of padding. For typical payloads containing large arrays, this overhead is negligible. For messages with many small aligned arrays, the overhead could be more significant. Implementations should consider using standard (unaligned) typed arrays for small arrays where the copy cost is trivial.
190+
191+
As a guideline: the copy cost of re-aligning `N` bytes is roughly proportional to `N`, while the padding overhead is bounded by a constant. For arrays larger than a few cache lines (e.g., >64 bytes of data), alignment padding is almost always worthwhile.
192+
193+
## Backward Compatibility
194+
195+
- Decoders that predate this proposal will encounter typed array category 3 with an unrecognized sub-type value of 2. This is a clean failure — the decoder knows it is dealing with a typed array but does not recognize the sub-type. This is no worse than an unknown extension ID, and arguably better since the context is preserved.
196+
197+
## Security Considerations
198+
199+
- Padding bytes are unspecified and **must** be ignored by decoders. Because zeros are valid data in a binary format, requiring zero-padding provides no security benefit and adds unnecessary verification cost in the decode path.
200+
- Decoders **must** validate that the computed padding does not extend beyond the message buffer.
201+
202+
## Summary
203+
204+
This proposal adds zero-copy typed array support to BEVE through a new sub-type within the existing typed array tag:
205+
206+
**Aligned Typed Array** (typed array category 3, sub-type 2): uses a second header byte to encode the numeric element type, followed by the element count, deterministic padding, and the data payload. Because alignment lives within the typed array tag itself, every extension that embeds a typed array — matrices, complex numbers, timestamps — gains zero-copy support automatically with no modifications.
207+
208+
This allows decoders to return direct pointers into the message buffer as typed spans, eliminating copy and allocation overhead for large numerical arrays — a critical optimization for scientific computing, real-time data processing, and high-throughput serialization pipelines.
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# BEVE Proposal: Framing Header
2+
3+
**Status:** Working Draft
4+
5+
## Motivation
6+
7+
BEVE messages currently begin directly with the root value. While this is compact, it means there is no in-band mechanism to:
8+
9+
1. **Identify** a byte sequence as a BEVE message (format identification for files and network protocols).
10+
2. **Version** the specification, allowing decoders to reject messages from unsupported future versions.
11+
12+
This proposal introduces a lightweight, optional framing header as a BEVE extension.
13+
14+
### Design Goals
15+
16+
1. **Format identification** — a BEVE message can be unambiguously identified by its leading bytes.
17+
2. **Version negotiation** — decoders can detect and reject messages from unsupported specification versions.
18+
3. **Minimal overhead** — the header is 2 bytes.
19+
4. **Backward compatibility** — legacy decoders encounter a clean failure (unknown extension), not silent misinterpretation.
20+
21+
## Extension 5 — Framing Header
22+
23+
**Extension ID:** 5
24+
25+
### Layout
26+
27+
```
28+
HEADER(ext=5) | VERSION (1 byte)
29+
```
30+
31+
**Total: 2 bytes before the root VALUE.**
32+
33+
#### HEADER(ext=5)
34+
35+
A single byte using the standard BEVE extension encoding: type bits = `6` (`0b110`), extension ID = `5` (`0b00101`).
36+
37+
```
38+
0b00101'110 → 0x2E
39+
```
40+
41+
Because this is a valid BEVE extension header, there is no collision with any standard BEVE root value type. A legacy decoder encountering this byte will parse it as an extension and either handle it or report an unknown extension — it will never silently misinterpret the message.
42+
43+
A normal BEVE message begins with a data-carrying type (null/boolean, number, string, object, typed array, generic array). A framing extension at position zero is unambiguously a framing header, not a data value.
44+
45+
#### VERSION (1 byte)
46+
47+
A single incrementing version number for the BEVE specification. The initial value is `1` for BEVE 1.0. This allows decoders to reject messages from unsupported future versions.
48+
49+
### Uniqueness
50+
51+
A BEVE message **must** contain at most one framing header, and if present it **must** be the first byte of the message (byte offset 0). A decoder **must** reject a message that contains a framing header at any other position.
52+
53+
### Backward Compatibility
54+
55+
- Messages **without** the framing header are fully valid BEVE and decode as before.
56+
- The framing header is always optional. A message may include a framing header even if it contains no other extensions or special features. Decoders **must** handle messages both with and without a framing header.
57+
- Decoders that do not understand extension 5 will encounter an unknown extension ID at the root. Per standard extension handling, they should report this rather than silently misparse.
58+
- The framing header is a valid BEVE extension byte, so it cannot be confused with any standard root value type.
59+
60+
## Summary
61+
62+
This proposal adds a 2-byte optional framing header (Extension 5) to BEVE: an extension header byte (`0x2E`) that unambiguously identifies a BEVE message, followed by a version byte. This is useful for files, network protocols, and any context where format identification or version negotiation is needed.

0 commit comments

Comments
 (0)