Skip to content

RFC: uint8array state encoding #234

@wmertens

Description

@wmertens

Champion

@wmertens

What's the motivation for this proposal?

We want to be as efficient as possible in encoding our state, and currently we're putting it in a recursive jagged array (array with arrays), which is terrible for performance/memory.

Note, this is an optimization and can be done at any point in the future without any breakage.

Proposed Solution / Feature

Instead, we ideally encode a byte stream. However, encoding strings is annoying, and html supports utf-8 which is quite space efficient, so we'll encode those separately and index them.

We still need to encode types, references and variable length arrays.
We need to encode a lot of integers of various sizes, so we'll encode those efficiently.

We also encounter a lot of arrays all of the same type, so we'll special-case that.

The approach is again a typeid + data, except now it's all in a flat array, and the data is variable length. If we put the constants in the typeid, about 60 typeIds. We can expand the types some more with various types of integers, refs, and array lengths and use the leftovers for more constants, like numbers that are common.

Because strings are references to existing memory, and the encoded state is a copy, we stream the state as base64 starting with [" and we complete with ",...${the JSON-encoded and script-escaped strings}]. This means we have an array with only strings, which is stored efficiently.

To encode arrays, we need to know where in the byte stream it starts and stops. However, we are streaming so we don't know what the length will be, so instead we'll write all typeIds behind their data, and parse backwards when deserializing.

Furthermore, it is common to have the same type of items in arrays. So we will have a special type of array for that. To allow easy switching, we merge all arrays that follow each other, except when separated by a marker.
If you have an array of constants, MixArray is best. But if it is an array of Int8 numbers, even if some of them are also constants, it should be a SingleArray.
However, this is quite complex, so perhaps just start with only MixArray.

We will encode the root state array itself as an array so we know how big an array to make when looking for indexes.

E.g. these could be the meanings of the typeId byte:

  • Undefined
  • Null
  • True
  • False
  • EmptyString
  • EMPTY_ARRAY
  • EMPTY_OBJ
  • NEEDS_COMPUTATION
  • STORE_ALL_PROPS
  • Slot
  • Fragment
  • NaN
  • PositiveInfinity
  • NegativeInfinity
  • MaxSafeInt
  • AlmostMaxSafeInt (used for close fragment)
  • MinSafeInt
  • +Int8 + 1 byte. We could in theory substract what we have as constants but that's extra work
  • -Int8 + 1
  • +Int16 + 2
  • -Int16 + 2
  • +Int32 + 4
  • -Int32 + 4
  • Number + 8 - non-integers, stored little-endian
  • StringIdx8
  • StringIdx16
  • StringIdx24 we don't support more than 16.7 million strings
  • RefIdx8
  • RefIdx16
  • RefIdx24 we don't support more than 16.7 MB state excluding strings (or should we?)
  • SingleArray8: + ...data (encoded without typeId), typeId, offset 1 byte (so the previous item will be at 1(item typeid) + 1(offset int) + 1 (SingleArray8 marker) + offset bytes backwards).
  • SingleArray16: + ...data (encoded without typeId), typeId, offset 2 byte (so the previous item will be at 1(item typeid) + 2(1-65536 bytes count) + 1 (SingleArray16 marker) + offset bytes backwards).
  • MixArray8: + ...data (encoded with typeId) + offset 1 byte
  • MixArray16: + ...data (encoded with typeId) + offset 2 byte
  • ArrayEnd: + count of items (encoded). Marks the end of the previous arrays. To know total offset, add up the previous Array+MixArray offsets. An empty array that is not the constant EMPTY_ARRAY has count 0.
  • VNode
  • RefVNode
  • Date: + 4 bytes epoch time
  • URL: + stringIndex (encoded int)
  • Regex: + stringIndex (encoded int), byte with the 8 "dgymsuvy" flags
  • BigInt: + stringIndex (encoded int) (just keep it as bigint in the strings array and stream as string)
  • URLSearchParams
  • Error
  • Object
  • Promise
  • Set
  • Map: + mixarray of key+value pairs
  • Uint8Array
  • QRL
  • Task
  • Resource
  • Component
  • Signal
  • WrappedSignal
  • ComputedSignal
  • SerializerSignal
  • Store
  • StoreArray
  • FormData
  • JSXNode
  • PropsProxy
  • EffectData
  • next 64: number 0-64
  • next 64: refIdx 0-64
  • rest: stringIdx 0-64 (a few more actually)

For example, the roots [32, 2.573, "hi", Slot, [Slot]] would encode as [base64(Num32Id <8 bytes> NumberId String0Id SlotId SlotId 1 MixArray8Id Int1Id ArrayEnd 17 MixArray8Id Num5Id ArrayEnd) (30 bytes base64), "hi"]

To decode, JSON.parse the result, take the first item, base64-decode, reconstruct the uint8array, splice off the first item so you are left with the strings array, and then go backwards to reconstruct the roots:

  • ArrayEnd, Num5Id, MixArray8Id, 17: create an array of 5 and stop when you get to 17 bytes earlier
  • index 4:
    • ArrayEnd, Num1Id, MixArray8Id, 1: create an array of 5 and stop when you get to 17 bytes earlier
      • index 0:
        • SlotId: assign the const Slot
  • index 3:
    • SlotId: assign the const Slot
  • index 2:
    • String0Id: assing the string at index 0
  • index 1:
    • NumberId, 8 bytes: reconstruct the float64
  • index 0:
    • Num32Id: assign 32

PRs/ Links / References

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    In Progress (STAGE 2)

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions