Skip to content

Commit 17a3479

Browse files
authored
blog: Interlude: A data-oriented model (#49)
1 parent c7a86e8 commit 17a3479

File tree

1 file changed

+326
-0
lines changed

1 file changed

+326
-0
lines changed
Lines changed: 326 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,326 @@
1+
---
2+
title: "Interlude: A data-oriented model"
3+
description: A real-world example of using data-oriented design principles in TypeScript.
4+
date: 2025-11-16
5+
authors:
6+
- name: Aapo Alasuutari
7+
url: https://github.com/aapoalas
8+
---
9+
10+
Hello again! It hasn't been that long since I [last](./worked-for-the-internet)
11+
blogged, and things are mostly as they were back then. A few meaningful changes
12+
have happened though: first, I am now on a two week (paid!) vacation, intending
13+
to finish up the NLnet grant project before time runs out. Second, I have
14+
received a negative "your project is not eligible" for the main grant
15+
application that I was secretly banking on, which would've set me on a course to
16+
develop Nova full-time for the next two years.
17+
18+
So, put simple I am now very temporarily working for the Internet yet again,
19+
after which I will return to Valmet Automation's sweet embrace. As such, it
20+
seems fitting to talk a little about the data-oriented design principles that
21+
underpin much of Nova, and how I've applied those principles in my day job as a
22+
TypeScript developer.
23+
24+
## A trip down memory lane
25+
26+
This is not the
27+
[first time](https://archive.fosdem.org/2025/schedule/event/fosdem-2025-4391-how-to-lose-weight-optimising-memory-usage-in-javascript-and-beyond/)
28+
I talk about using data-oriented design in TypeScript/JavaScript. In fact, this
29+
is something that I mentioned in the linked talk and which is explicitly
30+
explained in the [talk repository](https://github.com/aapoalas/losing-weight) as
31+
the
32+
["Data Model"](https://github.com/aapoalas/losing-weight/blob/main/src/4_data_model.ts).
33+
34+
The Data Model is mentioned to be
35+
36+
> A fundamental directed acyclic graph underpinning the flow of data from the
37+
> automation system to the user interface. Found to often take ~10 MiB a pop.
38+
39+
and it is made up of node objects that contain four properties:
40+
41+
1. `kind`: this determines the semantic meaning of a node.
42+
2. `in`: this determines the input nodes to this node by naming them in an
43+
`Array`. Both the order and any duplicates are significant here.
44+
3. `out`: this determines the output nodes of this node by naming them in a
45+
`Set`. Neither the order nor duplicates are significant here.
46+
4. `data`: this property's value depends on the `kind` and contains any extra
47+
data needed by the runtime semantics of the node.
48+
49+
These nodes are stored in a `Map<NodeName, DataModelNode>` and in addition,
50+
there exists effectively a `Map<NodeName, unknown>` data storage hash map for
51+
storing the current runtime value of a given node. Updating the Data Model then
52+
means running each node's runtime semantics on its input node's current runtime
53+
value, and storing the resulting value as this nodes' new runtime value.
54+
55+
The four "kinds" of nodes given are `const` which splits into two (actual
56+
constants and references), `subscription`, and `function`. Their runtime
57+
semantics and their associated extra data are as follows:
58+
59+
1. Constant nodes, `kind: "const"`: the node is a constant, has no extra data
60+
associated with it, and never has any input nodes. Updating a constant simply
61+
means assigning the new value as the node's new runtime value.
62+
2. Reference nodes, `kind: "const"`: the node is a reference to some other node,
63+
has no extra data associated with it, and always has exactly one input node.
64+
Updating the node means reading the only input node's current runtime value
65+
and assigning it as the reference node's new runtime value.
66+
3. Subscription nodes, `kind: "subscription"`: the node is a subscription into
67+
the automation network. Its extra data is a collection of parameters and
68+
options used to affect the subscription's exact semantics, and these nodes
69+
are known to always have one or two input nodes: the first input node
70+
contains the subscription address, and the second optional one is a dynamic
71+
object of parameters. Updating a subscription node means unsubscribing the
72+
previous subscription address (if non-null), subscribing the new address (as
73+
given by the first parameter node's runtiem value), and setting the
74+
subscription node's current runtime value to `null`. When the subscription
75+
from the automation network responds with data, that data is set as the
76+
subscription node's current runtime value and an update is dispatched to its
77+
output nodes.
78+
4. Function nodes, `kind: "function"`: the node is a function on its inputs. Its
79+
extra data is the function name (to be looked up from a function storage
80+
Map). Updating the node means reading the current runtime values of its input
81+
nodes, and running an actual JavaScript function with those values as the
82+
arguments. The result of the function is stored as the new runtime value of
83+
the function node.
84+
85+
The way these nodes are constructed is by, effectively, parsing a
86+
JavaScript-based domain-specific language (DSL) that looks something like this:
87+
88+
```javascript
89+
let tag = "LIC-100";
90+
let address = combineStrings("/plant/", ref("tag"), "/isGood");
91+
let isGood = negate(subscription(ref(address)));
92+
```
93+
94+
The `tag`, `address`, and `isGood` are properties and their values are parsed as
95+
parts of the Data Model. `tag`'s value `"LIC-100"` is parsed as just a constant,
96+
while `address` is parsed as a function node calling a function by the name of
97+
`combineStrings` with three parameters: the first one is a constant parameter
98+
`"/plant/"`, the second is a reference node pointing to the property `tag`, and
99+
the third one is again a constant parameter with value `"/isGood"`. Finally, the
100+
`isGood` property is parsed as a function node calling the function `negate`
101+
with the value of a subscription node that takes as its address a reference node
102+
pointing to the property `address`.
103+
104+
At this point, I want to ask a question: do you think that the object based node
105+
structure seems to make sense? Ponder to yourself for a moment, is this the kind
106+
of code that you'd write? Or do you see silliness that you know you'd never
107+
commit to?
108+
109+
I am not quite sure myself: by now all of this code was either written or
110+
rewritten by me at some point, although I did inherit the basic structure of it
111+
originally. So obviously I thought this made sense, but I'm not entirely sure if
112+
I would write it anymore. At the very least it's clear to me that there are
113+
issues in this code, though they may not be dealbreakers depending on the
114+
use-case.
115+
116+
## I am altering the deal...
117+
118+
The main issues in the existing implementation become quite clear when we look
119+
at it in the details. The very first issue is simply the memory usage: in Chrome
120+
each node object took up `(3 + 4) * 4` (3 for the object header + 4 inline
121+
properties) or 28 bytes. Add to that the 16 bytes needed for both the `in`
122+
`Array` and the `out` `Set` and we're already at 60 bytes, or nearly a full
123+
cache line of data for a single node. Add in the `out` `Set`'s backing memory
124+
allocation, which is done even when the `Set` is empty and takes probably more
125+
than a full cache line on its own, and we're probably easily over two or even
126+
three cache lines of data. The total memory usage for an empty node is probably
127+
something around 150 bytes.
128+
129+
But there are structural issues with the nodes as well. First, while nodes
130+
belonging to properties like `tag` or `address` can have references pointing to
131+
them, there is no way for a reference to refer to eg. the `"/plant/"` constant
132+
parameter "inside" the `address` property's node graph: this means that we know
133+
that all "parameter" nodes must always have exactly one output, which is the
134+
node that they are a parameter of. This makes the outputs `Set` seem quite
135+
ridiculous indeed with its large backing memory allocation used to store just a
136+
single node name string most of the time. Second, the number of inputs is often
137+
small and statically known (0 for constants, 1 for references, 1 or 2 for tags);
138+
even for functions we know the number of inputs for a given function node during
139+
parsing so we have no need for a dynamically resizable container to store the
140+
input names. This makes the `in` `Array` seem quite ridiculous as well.
141+
142+
Third, constant parameter nodes (like the `"/plant/"` string) do not really
143+
serve any purpose: we just want to know that they are constant parameter nodes
144+
but the node object itself has nothing of value to us: the output is never
145+
needed as the constant parameter can never change (meaning that we never ask the
146+
question "what is the output of this constant parameter node"), the inputs Array
147+
is known to be empty, and no extra data exists for constants. The only thing
148+
we're interested in is the current runtime value of the node, and that is stored
149+
in a separate `Map`.
150+
151+
Fourth, reference parameter nodes do not really serve any purpose: instead of
152+
creating a separate node whose only purpose is to have an input pointing to eg.
153+
`tag`, we could just as well remove that entire node and have the reference
154+
node's output (usually a function or subscription node) refer to that `tag`
155+
directly.
156+
157+
The third and fourth issues I had already taken care of ages ago; constant and
158+
reference parameter nodes do not exist in the Data Model at all. The first and
159+
second points I hadn't fully realised yet, but I had plans...
160+
161+
## ... pray I don't alter it any further
162+
163+
I had actually seen some other issues as well. The `kind` field was a huge waste
164+
of memory, taking up an entire JavaScript Value (4 or 8 bytes depending on the
165+
engine) to store what amounted to 2 bits of information (one of 4 options).
166+
Likewise, the extra data for subscription nodes was horrendously inefficient,
167+
storing a set of JavaScript booleans in an object with each boolean fully filled
168+
in with its default value if not explicitly defined in the source DSL, so as to
169+
optimise object shapes. That meant using many tens of bytes to store what
170+
amounted to a few bits of data.
171+
172+
But even had I fixed all of these issues, the reality was still that our Data
173+
Models can get really big, too big. We're talking half a million to a million
174+
nodes per Data Model, and there is no exact limit to how many Data Models a user
175+
can have open at the same time. (Funny story, a particular customer had noticed
176+
a cool trick where they could sort of minimise parts of the UI and then use a
177+
double-click feature to bring it quickly back into view. This meant that they
178+
had tens of large Data Models running simultaneously, as opposed to the expected
179+
count of low single digits. Users are clever!)
180+
181+
At those numbers, just the object headers for a single Data Model's nodes add up
182+
to nearly 6 MiB. My bet for solving this issue was thus not to try shrink the
183+
JavaScript node objects at all, but to remove them entirely! And this is where
184+
we get to the data-oriented design part of the blog post.
185+
186+
## Lining it all up
187+
188+
The answer to all of this was obviously to take matters into my own hands using
189+
ArrayBuffers and TypedArrays. The `kind` field could easily fit into a
190+
`Uint8Array`, while the others seemed to be begging for a bit of a rethought.
191+
192+
I'm going to skip to the end here, and just tell you what I did: the final
193+
result was that a single Data Model node is an index in three TypedArrays: the
194+
`kindColumn`, the `outColumn`, and the `payloadColumn`. These three form what
195+
could be called the "node table". Additionally, an `extraDataColumn` exists on
196+
the side that has a length dependent on the contents of the node table. In this
197+
transformation, the number of node `kind`s shot up from 3 (effectively 4) to 7,
198+
and they are now of course number values stored in a `Uint8Array` instead of
199+
strings like before. The `kind`s are:
200+
201+
1. Constant node: same as before.
202+
1. Reference node: same as before, except now with a different `kind` value.
203+
1. Nullary function node: a function taking no parameters.
204+
1. Unary function node: a function taking one parameter.
205+
1. N-ary function node: a function taking two or more parameters.
206+
1. Subscription node: a subscription node with no non-boolean options (`minTime`
207+
/ `maxTime`) or dynamic parameters, ie. only has one input node.
208+
1. Parametrised subscription node: a subscription node with some non-boolean
209+
options or dynamic parameters. This has one or two input nodes.
210+
211+
Each node has an `out` value (stored in the `outColumn`) which is a relative
212+
offset forwards in the node table pointing to the node's output node. If the
213+
relative offset is 0, then this node is a property node. In these cases, the
214+
node has extra data (like the incoming references to this property) available in
215+
a separate "property table" which I'm going to gloss over today.
216+
217+
Finally, the `payload` value of each node (stored in the `payloadColumn`)
218+
depends on the `kind` of the node, but a common theme is that in most cases the
219+
payload is an index into some storage `Array`. They go like this:
220+
221+
1. Constant node: the payload is an index into a global array of constant
222+
values.
223+
1. Reference node: the payload is an index into a global array of property
224+
names.
225+
1. Nullary and unary function nodes: the payload is an index into a global array
226+
of function names.
227+
1. N-ary function node: the payload is an index into the local
228+
`extraDataColumn`. The pointed-to index contains an index into the global
229+
array of function names, the index after that is the number of inputs this
230+
function node has, and subsequent indexes after that contain relative offsets
231+
backwards in the node table pointing to each input node.
232+
1. Subscription node: the payload is a bitset of the boolean options of the
233+
subscription.
234+
1. Parametrised subscription node: the payload is an index into the local
235+
`extraDataColumn`. The pointed-to index contains the bitset of boolean
236+
options and bits indicating which of the `minTime`, `maxTime`, and two input
237+
parameter offsets are stored in subsequent indexes of the extra data.
238+
239+
If you've heard [how Zig builds its compiler](https://vimeo.com/649009599), this
240+
might sound very familiar because it's very much the "encoding strategy" as
241+
named by Andrew Kelley. The `kind` is used to store not just the "kind" of node
242+
we're dealing with but also some information about its data contents, which then
243+
means that we can skip storing that information, simplifying the required
244+
storage format.
245+
246+
Now, the `kindColumn` is always a `Uint8Array` so each `kind` field costs 1 byte
247+
of memory, but the `outputColumn` and `payloadColumn` I haven't given a concrete
248+
type for yet: this is because they do not have a guaranteed type. I'm taking
249+
advantage of the fact that these have fairly similar contents between one node
250+
and the next, and am thus eagerly allocating them using the smallest possible
251+
unsigned integer TypedArray that fits the current data: generally this means
252+
that `outputColumn` is a `Uint8Array`, and `payloadColumn` is either a
253+
`Uint16Array` or a `Uint32Array`. As a result, a single "base node" is 6 bytes
254+
in size. Compared to the 60 bytes we started off with we have cut memory usage
255+
of a node 10x, or more if we count in the output `Set`'s backing memory
256+
allocation.
257+
258+
The "node table" has thus changed from this:
259+
260+
```typescript
261+
interface DataModelNode {
262+
kind: string;
263+
in: NodeName[];
264+
out: Set<NodeName>;
265+
data: unknown;
266+
}
267+
type NodeTable = Map<NodeName, DataModelNode>;
268+
```
269+
270+
into this
271+
272+
```typescript
273+
interface NodeTable {
274+
kindColumn: Uint8Array;
275+
outputColumn: Uint8Array | Uint16Array | Uint32Array; // usually Uint8Array or Uint16Array
276+
payloadColumn: Uint8Array | Uint16Array | Uint32Array; // usually Uint16Array or Uint32Array
277+
extraDataColumn: Uint8Array | Uint16Array | Uint32Array; // usually Uint16Array or Uint32Array
278+
}
279+
```
280+
281+
The number of objects is cut from `1 + N * 3` where `N` is the number of nodes,
282+
to just 6 (counting a single shared `ArrayBuffer` shared between all the
283+
columns), no matter the number of nodes (and we actually make `extraDataColumn`
284+
`null` if it is empty, and we drop the node table entirely if it is empty, so
285+
the number of objects can go to 5 or 0). All in all, the memory usage seen in
286+
real usage went from more than 10 MiB to a bit over 1 MiB.
287+
288+
## Get your ducks in a row
289+
290+
Okay, that sounds wonderful: should everything be written like this from now on?
291+
Well, yes and no. TypeScript isn't exactly the easiest language to use
292+
semi-manual memory management like this in
293+
([maybe we can make it a little better, though?](https://github.com/microsoft/TypeScript/issues/62752)),
294+
so the code complexity downside on its own might make this whole thing untenable
295+
in the small. But the final nail in the coffin is that TypedArrays and
296+
`ArrayBuffer`s are massive objects in at least the V8 engine. If you have only a
297+
few objects, then the cost of a TypedArray will overwhelm the cost of a few
298+
objects. Only once you get into multiple tens of objects does the math change.
299+
300+
That code complexity though: no matter how numerous your objects, it doesn't
301+
really change the code complexity. This kind of code is and looks foreign: the
302+
best thing you can probably do is create helper classes that encapsulate the
303+
indexing behind APIs like `getNodeKind` and `getFunctionName`. Soon enough
304+
you'll find yourself arguing between safety and performance: should
305+
`getNodeKind` explicitly throw if the passed-in index is out of bounds? Should
306+
`getFunctionName` check that the passed-in index really points to a function
307+
kind, or should it simply interpret the node payload as a function name index
308+
and read into the global function name array? In Rust that would be accessing a
309+
`union` field without a check that the field is necessarily valid, which would
310+
make the calling function `unsafe`: do you start naming some functions as
311+
`unsafeGetFunctionName`, or is that a bridge too far?
312+
313+
I've glossed over all of those complexities here, and for a good reason I think:
314+
nobody wants to read 2000 lines of dense, unfamiliar TypeScript code. Just rest
315+
assured that the code exists, it works, has been tested, is heading into
316+
production, and even achieves a fairly good compile-time type safety to boot. It
317+
just isn't trivial. When I return to work in two weeks time, I'll be returning
318+
to more of this same work; the Data Model is split into two parts, a static
319+
version of it that is created once and used as a template when instantiating
320+
dynamic Data Models, and the dynamic side. I've only done the static version so
321+
far (which also gives me some extra benefits and ease of implementation that
322+
I've taken advantage of here), and next up will be the real deal: dealing with
323+
the actual, dynamic runtime Data Models.
324+
325+
But before that, it's back to Nova JavaScript engine and Rust for me. Thanks for
326+
reading, I'll see you on the other side.

0 commit comments

Comments
 (0)