|
| 1 | +--- |
| 2 | +title: "Interlude: A data-oriented model" |
| 3 | +description: A real-world example of using data-oriented design principles in TypeScript. |
| 4 | +date: 2025-11-16 |
| 5 | +authors: |
| 6 | + - name: Aapo Alasuutari |
| 7 | + url: https://github.com/aapoalas |
| 8 | +--- |
| 9 | + |
| 10 | +Hello again! It hasn't been that long since I [last](./worked-for-the-internet) |
| 11 | +blogged, and things are mostly as they were back then. A few meaningful changes |
| 12 | +have happened though: first, I am now on a two week (paid!) vacation, intending |
| 13 | +to finish up the NLnet grant project before time runs out. Second, I have |
| 14 | +received a negative "your project is not eligible" for the main grant |
| 15 | +application that I was secretly banking on, which would've set me on a course to |
| 16 | +develop Nova full-time for the next two years. |
| 17 | + |
| 18 | +So, put simple I am now very temporarily working for the Internet yet again, |
| 19 | +after which I will return to Valmet Automation's sweet embrace. As such, it |
| 20 | +seems fitting to talk a little about the data-oriented design principles that |
| 21 | +underpin much of Nova, and how I've applied those principles in my day job as a |
| 22 | +TypeScript developer. |
| 23 | + |
| 24 | +## A trip down memory lane |
| 25 | + |
| 26 | +This is not the |
| 27 | +[first time](https://archive.fosdem.org/2025/schedule/event/fosdem-2025-4391-how-to-lose-weight-optimising-memory-usage-in-javascript-and-beyond/) |
| 28 | +I talk about using data-oriented design in TypeScript/JavaScript. In fact, this |
| 29 | +is something that I mentioned in the linked talk and which is explicitly |
| 30 | +explained in the [talk repository](https://github.com/aapoalas/losing-weight) as |
| 31 | +the |
| 32 | +["Data Model"](https://github.com/aapoalas/losing-weight/blob/main/src/4_data_model.ts). |
| 33 | + |
| 34 | +The Data Model is mentioned to be |
| 35 | + |
| 36 | +> A fundamental directed acyclic graph underpinning the flow of data from the |
| 37 | +> automation system to the user interface. Found to often take ~10 MiB a pop. |
| 38 | +
|
| 39 | +and it is made up of node objects that contain four properties: |
| 40 | + |
| 41 | +1. `kind`: this determines the semantic meaning of a node. |
| 42 | +2. `in`: this determines the input nodes to this node by naming them in an |
| 43 | + `Array`. Both the order and any duplicates are significant here. |
| 44 | +3. `out`: this determines the output nodes of this node by naming them in a |
| 45 | + `Set`. Neither the order nor duplicates are significant here. |
| 46 | +4. `data`: this property's value depends on the `kind` and contains any extra |
| 47 | + data needed by the runtime semantics of the node. |
| 48 | + |
| 49 | +These nodes are stored in a `Map<NodeName, DataModelNode>` and in addition, |
| 50 | +there exists effectively a `Map<NodeName, unknown>` data storage hash map for |
| 51 | +storing the current runtime value of a given node. Updating the Data Model then |
| 52 | +means running each node's runtime semantics on its input node's current runtime |
| 53 | +value, and storing the resulting value as this nodes' new runtime value. |
| 54 | + |
| 55 | +The four "kinds" of nodes given are `const` which splits into two (actual |
| 56 | +constants and references), `subscription`, and `function`. Their runtime |
| 57 | +semantics and their associated extra data are as follows: |
| 58 | + |
| 59 | +1. Constant nodes, `kind: "const"`: the node is a constant, has no extra data |
| 60 | + associated with it, and never has any input nodes. Updating a constant simply |
| 61 | + means assigning the new value as the node's new runtime value. |
| 62 | +2. Reference nodes, `kind: "const"`: the node is a reference to some other node, |
| 63 | + has no extra data associated with it, and always has exactly one input node. |
| 64 | + Updating the node means reading the only input node's current runtime value |
| 65 | + and assigning it as the reference node's new runtime value. |
| 66 | +3. Subscription nodes, `kind: "subscription"`: the node is a subscription into |
| 67 | + the automation network. Its extra data is a collection of parameters and |
| 68 | + options used to affect the subscription's exact semantics, and these nodes |
| 69 | + are known to always have one or two input nodes: the first input node |
| 70 | + contains the subscription address, and the second optional one is a dynamic |
| 71 | + object of parameters. Updating a subscription node means unsubscribing the |
| 72 | + previous subscription address (if non-null), subscribing the new address (as |
| 73 | + given by the first parameter node's runtiem value), and setting the |
| 74 | + subscription node's current runtime value to `null`. When the subscription |
| 75 | + from the automation network responds with data, that data is set as the |
| 76 | + subscription node's current runtime value and an update is dispatched to its |
| 77 | + output nodes. |
| 78 | +4. Function nodes, `kind: "function"`: the node is a function on its inputs. Its |
| 79 | + extra data is the function name (to be looked up from a function storage |
| 80 | + Map). Updating the node means reading the current runtime values of its input |
| 81 | + nodes, and running an actual JavaScript function with those values as the |
| 82 | + arguments. The result of the function is stored as the new runtime value of |
| 83 | + the function node. |
| 84 | + |
| 85 | +The way these nodes are constructed is by, effectively, parsing a |
| 86 | +JavaScript-based domain-specific language (DSL) that looks something like this: |
| 87 | + |
| 88 | +```javascript |
| 89 | +let tag = "LIC-100"; |
| 90 | +let address = combineStrings("/plant/", ref("tag"), "/isGood"); |
| 91 | +let isGood = negate(subscription(ref(address))); |
| 92 | +``` |
| 93 | + |
| 94 | +The `tag`, `address`, and `isGood` are properties and their values are parsed as |
| 95 | +parts of the Data Model. `tag`'s value `"LIC-100"` is parsed as just a constant, |
| 96 | +while `address` is parsed as a function node calling a function by the name of |
| 97 | +`combineStrings` with three parameters: the first one is a constant parameter |
| 98 | +`"/plant/"`, the second is a reference node pointing to the property `tag`, and |
| 99 | +the third one is again a constant parameter with value `"/isGood"`. Finally, the |
| 100 | +`isGood` property is parsed as a function node calling the function `negate` |
| 101 | +with the value of a subscription node that takes as its address a reference node |
| 102 | +pointing to the property `address`. |
| 103 | + |
| 104 | +At this point, I want to ask a question: do you think that the object based node |
| 105 | +structure seems to make sense? Ponder to yourself for a moment, is this the kind |
| 106 | +of code that you'd write? Or do you see silliness that you know you'd never |
| 107 | +commit to? |
| 108 | + |
| 109 | +I am not quite sure myself: by now all of this code was either written or |
| 110 | +rewritten by me at some point, although I did inherit the basic structure of it |
| 111 | +originally. So obviously I thought this made sense, but I'm not entirely sure if |
| 112 | +I would write it anymore. At the very least it's clear to me that there are |
| 113 | +issues in this code, though they may not be dealbreakers depending on the |
| 114 | +use-case. |
| 115 | + |
| 116 | +## I am altering the deal... |
| 117 | + |
| 118 | +The main issues in the existing implementation become quite clear when we look |
| 119 | +at it in the details. The very first issue is simply the memory usage: in Chrome |
| 120 | +each node object took up `(3 + 4) * 4` (3 for the object header + 4 inline |
| 121 | +properties) or 28 bytes. Add to that the 16 bytes needed for both the `in` |
| 122 | +`Array` and the `out` `Set` and we're already at 60 bytes, or nearly a full |
| 123 | +cache line of data for a single node. Add in the `out` `Set`'s backing memory |
| 124 | +allocation, which is done even when the `Set` is empty and takes probably more |
| 125 | +than a full cache line on its own, and we're probably easily over two or even |
| 126 | +three cache lines of data. The total memory usage for an empty node is probably |
| 127 | +something around 150 bytes. |
| 128 | + |
| 129 | +But there are structural issues with the nodes as well. First, while nodes |
| 130 | +belonging to properties like `tag` or `address` can have references pointing to |
| 131 | +them, there is no way for a reference to refer to eg. the `"/plant/"` constant |
| 132 | +parameter "inside" the `address` property's node graph: this means that we know |
| 133 | +that all "parameter" nodes must always have exactly one output, which is the |
| 134 | +node that they are a parameter of. This makes the outputs `Set` seem quite |
| 135 | +ridiculous indeed with its large backing memory allocation used to store just a |
| 136 | +single node name string most of the time. Second, the number of inputs is often |
| 137 | +small and statically known (0 for constants, 1 for references, 1 or 2 for tags); |
| 138 | +even for functions we know the number of inputs for a given function node during |
| 139 | +parsing so we have no need for a dynamically resizable container to store the |
| 140 | +input names. This makes the `in` `Array` seem quite ridiculous as well. |
| 141 | + |
| 142 | +Third, constant parameter nodes (like the `"/plant/"` string) do not really |
| 143 | +serve any purpose: we just want to know that they are constant parameter nodes |
| 144 | +but the node object itself has nothing of value to us: the output is never |
| 145 | +needed as the constant parameter can never change (meaning that we never ask the |
| 146 | +question "what is the output of this constant parameter node"), the inputs Array |
| 147 | +is known to be empty, and no extra data exists for constants. The only thing |
| 148 | +we're interested in is the current runtime value of the node, and that is stored |
| 149 | +in a separate `Map`. |
| 150 | + |
| 151 | +Fourth, reference parameter nodes do not really serve any purpose: instead of |
| 152 | +creating a separate node whose only purpose is to have an input pointing to eg. |
| 153 | +`tag`, we could just as well remove that entire node and have the reference |
| 154 | +node's output (usually a function or subscription node) refer to that `tag` |
| 155 | +directly. |
| 156 | + |
| 157 | +The third and fourth issues I had already taken care of ages ago; constant and |
| 158 | +reference parameter nodes do not exist in the Data Model at all. The first and |
| 159 | +second points I hadn't fully realised yet, but I had plans... |
| 160 | + |
| 161 | +## ... pray I don't alter it any further |
| 162 | + |
| 163 | +I had actually seen some other issues as well. The `kind` field was a huge waste |
| 164 | +of memory, taking up an entire JavaScript Value (4 or 8 bytes depending on the |
| 165 | +engine) to store what amounted to 2 bits of information (one of 4 options). |
| 166 | +Likewise, the extra data for subscription nodes was horrendously inefficient, |
| 167 | +storing a set of JavaScript booleans in an object with each boolean fully filled |
| 168 | +in with its default value if not explicitly defined in the source DSL, so as to |
| 169 | +optimise object shapes. That meant using many tens of bytes to store what |
| 170 | +amounted to a few bits of data. |
| 171 | + |
| 172 | +But even had I fixed all of these issues, the reality was still that our Data |
| 173 | +Models can get really big, too big. We're talking half a million to a million |
| 174 | +nodes per Data Model, and there is no exact limit to how many Data Models a user |
| 175 | +can have open at the same time. (Funny story, a particular customer had noticed |
| 176 | +a cool trick where they could sort of minimise parts of the UI and then use a |
| 177 | +double-click feature to bring it quickly back into view. This meant that they |
| 178 | +had tens of large Data Models running simultaneously, as opposed to the expected |
| 179 | +count of low single digits. Users are clever!) |
| 180 | + |
| 181 | +At those numbers, just the object headers for a single Data Model's nodes add up |
| 182 | +to nearly 6 MiB. My bet for solving this issue was thus not to try shrink the |
| 183 | +JavaScript node objects at all, but to remove them entirely! And this is where |
| 184 | +we get to the data-oriented design part of the blog post. |
| 185 | + |
| 186 | +## Lining it all up |
| 187 | + |
| 188 | +The answer to all of this was obviously to take matters into my own hands using |
| 189 | +ArrayBuffers and TypedArrays. The `kind` field could easily fit into a |
| 190 | +`Uint8Array`, while the others seemed to be begging for a bit of a rethought. |
| 191 | + |
| 192 | +I'm going to skip to the end here, and just tell you what I did: the final |
| 193 | +result was that a single Data Model node is an index in three TypedArrays: the |
| 194 | +`kindColumn`, the `outColumn`, and the `payloadColumn`. These three form what |
| 195 | +could be called the "node table". Additionally, an `extraDataColumn` exists on |
| 196 | +the side that has a length dependent on the contents of the node table. In this |
| 197 | +transformation, the number of node `kind`s shot up from 3 (effectively 4) to 7, |
| 198 | +and they are now of course number values stored in a `Uint8Array` instead of |
| 199 | +strings like before. The `kind`s are: |
| 200 | + |
| 201 | +1. Constant node: same as before. |
| 202 | +1. Reference node: same as before, except now with a different `kind` value. |
| 203 | +1. Nullary function node: a function taking no parameters. |
| 204 | +1. Unary function node: a function taking one parameter. |
| 205 | +1. N-ary function node: a function taking two or more parameters. |
| 206 | +1. Subscription node: a subscription node with no non-boolean options (`minTime` |
| 207 | + / `maxTime`) or dynamic parameters, ie. only has one input node. |
| 208 | +1. Parametrised subscription node: a subscription node with some non-boolean |
| 209 | + options or dynamic parameters. This has one or two input nodes. |
| 210 | + |
| 211 | +Each node has an `out` value (stored in the `outColumn`) which is a relative |
| 212 | +offset forwards in the node table pointing to the node's output node. If the |
| 213 | +relative offset is 0, then this node is a property node. In these cases, the |
| 214 | +node has extra data (like the incoming references to this property) available in |
| 215 | +a separate "property table" which I'm going to gloss over today. |
| 216 | + |
| 217 | +Finally, the `payload` value of each node (stored in the `payloadColumn`) |
| 218 | +depends on the `kind` of the node, but a common theme is that in most cases the |
| 219 | +payload is an index into some storage `Array`. They go like this: |
| 220 | + |
| 221 | +1. Constant node: the payload is an index into a global array of constant |
| 222 | + values. |
| 223 | +1. Reference node: the payload is an index into a global array of property |
| 224 | + names. |
| 225 | +1. Nullary and unary function nodes: the payload is an index into a global array |
| 226 | + of function names. |
| 227 | +1. N-ary function node: the payload is an index into the local |
| 228 | + `extraDataColumn`. The pointed-to index contains an index into the global |
| 229 | + array of function names, the index after that is the number of inputs this |
| 230 | + function node has, and subsequent indexes after that contain relative offsets |
| 231 | + backwards in the node table pointing to each input node. |
| 232 | +1. Subscription node: the payload is a bitset of the boolean options of the |
| 233 | + subscription. |
| 234 | +1. Parametrised subscription node: the payload is an index into the local |
| 235 | + `extraDataColumn`. The pointed-to index contains the bitset of boolean |
| 236 | + options and bits indicating which of the `minTime`, `maxTime`, and two input |
| 237 | + parameter offsets are stored in subsequent indexes of the extra data. |
| 238 | + |
| 239 | +If you've heard [how Zig builds its compiler](https://vimeo.com/649009599), this |
| 240 | +might sound very familiar because it's very much the "encoding strategy" as |
| 241 | +named by Andrew Kelley. The `kind` is used to store not just the "kind" of node |
| 242 | +we're dealing with but also some information about its data contents, which then |
| 243 | +means that we can skip storing that information, simplifying the required |
| 244 | +storage format. |
| 245 | + |
| 246 | +Now, the `kindColumn` is always a `Uint8Array` so each `kind` field costs 1 byte |
| 247 | +of memory, but the `outputColumn` and `payloadColumn` I haven't given a concrete |
| 248 | +type for yet: this is because they do not have a guaranteed type. I'm taking |
| 249 | +advantage of the fact that these have fairly similar contents between one node |
| 250 | +and the next, and am thus eagerly allocating them using the smallest possible |
| 251 | +unsigned integer TypedArray that fits the current data: generally this means |
| 252 | +that `outputColumn` is a `Uint8Array`, and `payloadColumn` is either a |
| 253 | +`Uint16Array` or a `Uint32Array`. As a result, a single "base node" is 6 bytes |
| 254 | +in size. Compared to the 60 bytes we started off with we have cut memory usage |
| 255 | +of a node 10x, or more if we count in the output `Set`'s backing memory |
| 256 | +allocation. |
| 257 | + |
| 258 | +The "node table" has thus changed from this: |
| 259 | + |
| 260 | +```typescript |
| 261 | +interface DataModelNode { |
| 262 | + kind: string; |
| 263 | + in: NodeName[]; |
| 264 | + out: Set<NodeName>; |
| 265 | + data: unknown; |
| 266 | +} |
| 267 | +type NodeTable = Map<NodeName, DataModelNode>; |
| 268 | +``` |
| 269 | + |
| 270 | +into this |
| 271 | + |
| 272 | +```typescript |
| 273 | +interface NodeTable { |
| 274 | + kindColumn: Uint8Array; |
| 275 | + outputColumn: Uint8Array | Uint16Array | Uint32Array; // usually Uint8Array or Uint16Array |
| 276 | + payloadColumn: Uint8Array | Uint16Array | Uint32Array; // usually Uint16Array or Uint32Array |
| 277 | + extraDataColumn: Uint8Array | Uint16Array | Uint32Array; // usually Uint16Array or Uint32Array |
| 278 | +} |
| 279 | +``` |
| 280 | + |
| 281 | +The number of objects is cut from `1 + N * 3` where `N` is the number of nodes, |
| 282 | +to just 6 (counting a single shared `ArrayBuffer` shared between all the |
| 283 | +columns), no matter the number of nodes (and we actually make `extraDataColumn` |
| 284 | +`null` if it is empty, and we drop the node table entirely if it is empty, so |
| 285 | +the number of objects can go to 5 or 0). All in all, the memory usage seen in |
| 286 | +real usage went from more than 10 MiB to a bit over 1 MiB. |
| 287 | + |
| 288 | +## Get your ducks in a row |
| 289 | + |
| 290 | +Okay, that sounds wonderful: should everything be written like this from now on? |
| 291 | +Well, yes and no. TypeScript isn't exactly the easiest language to use |
| 292 | +semi-manual memory management like this in |
| 293 | +([maybe we can make it a little better, though?](https://github.com/microsoft/TypeScript/issues/62752)), |
| 294 | +so the code complexity downside on its own might make this whole thing untenable |
| 295 | +in the small. But the final nail in the coffin is that TypedArrays and |
| 296 | +`ArrayBuffer`s are massive objects in at least the V8 engine. If you have only a |
| 297 | +few objects, then the cost of a TypedArray will overwhelm the cost of a few |
| 298 | +objects. Only once you get into multiple tens of objects does the math change. |
| 299 | + |
| 300 | +That code complexity though: no matter how numerous your objects, it doesn't |
| 301 | +really change the code complexity. This kind of code is and looks foreign: the |
| 302 | +best thing you can probably do is create helper classes that encapsulate the |
| 303 | +indexing behind APIs like `getNodeKind` and `getFunctionName`. Soon enough |
| 304 | +you'll find yourself arguing between safety and performance: should |
| 305 | +`getNodeKind` explicitly throw if the passed-in index is out of bounds? Should |
| 306 | +`getFunctionName` check that the passed-in index really points to a function |
| 307 | +kind, or should it simply interpret the node payload as a function name index |
| 308 | +and read into the global function name array? In Rust that would be accessing a |
| 309 | +`union` field without a check that the field is necessarily valid, which would |
| 310 | +make the calling function `unsafe`: do you start naming some functions as |
| 311 | +`unsafeGetFunctionName`, or is that a bridge too far? |
| 312 | + |
| 313 | +I've glossed over all of those complexities here, and for a good reason I think: |
| 314 | +nobody wants to read 2000 lines of dense, unfamiliar TypeScript code. Just rest |
| 315 | +assured that the code exists, it works, has been tested, is heading into |
| 316 | +production, and even achieves a fairly good compile-time type safety to boot. It |
| 317 | +just isn't trivial. When I return to work in two weeks time, I'll be returning |
| 318 | +to more of this same work; the Data Model is split into two parts, a static |
| 319 | +version of it that is created once and used as a template when instantiating |
| 320 | +dynamic Data Models, and the dynamic side. I've only done the static version so |
| 321 | +far (which also gives me some extra benefits and ease of implementation that |
| 322 | +I've taken advantage of here), and next up will be the real deal: dealing with |
| 323 | +the actual, dynamic runtime Data Models. |
| 324 | + |
| 325 | +But before that, it's back to Nova JavaScript engine and Rust for me. Thanks for |
| 326 | +reading, I'll see you on the other side. |
0 commit comments