Skip to content

Memory footprint analysis #108

@char0n

Description

@char0n

ApiDOM Memory Footprint Analysis

Test Subject

  • File: GitHub REST API OpenAPI description (api.github.com.2022-11-28.deref.json)
  • Format: OpenAPI 3.1.0, fully dereferenced
  • Size: 73.4 MB (JSON), 54.2 MB (YAML equivalent)
  • Node.js: v24.10.0, V8

Pipeline Overview

JSON path (strict mode)

JSON string → JSON.parse (POJO) → baseRefract (Generic ApiDOM) → refractOpenApi3_1 (Semantic ApiDOM)

YAML path (tree-sitter)

YAML string → tree-sitter CST → CSTTransformer → YAML AST → YAMLASTTransformer → Generic ApiDOM → refractOpenApi3_1 (Semantic ApiDOM)

CSTTransformer and YAMLASTTransformer run serially — each stage is consumed before the next produces output. CST and YAML AST do not accumulate in memory simultaneously.

Measurement Results

JSON Pipeline (73.4 MB file)

Stage Incremental Cost Cumulative Heap Multiplier vs String
Raw string +147 MB 147 MB 2.0x
JSON.parse (POJO) +52 MB 199 MB 2.7x
baseRefract (generic ApiDOM) +408 MB 607 MB 8.3x
refractOpenApi3_1 (semantic ApiDOM) +2,064 MB 2,671 MB 36.4x

Note: the semantic refraction delta (+2,064 MB) includes an internal baseRefract call (~408 MB for a second generic tree) plus the semantic tree construction.

Realistic single-call to refractOpenApi3_1: ~2,278 MB total, ~31x the string.

YAML Pipeline (0.73 MB file, 30 paths subset)

Stage No Source Maps With Source Maps
CST (tree-sitter) ~0 MB ~0 MB
Generic ApiDOM +6.5 MB +6.2 MB
Semantic ApiDOM +21.0 MB +20.4 MB
Total 27.2 MB (37x) 26.6 MB (36x)

tree-sitter-yaml limitation: max 32,768 lines per file (see tree-sitter-yaml#35).

Key Observations

  • CST cost is negligible — tree-sitter allocates in native/WASM memory, not on the V8 heap.
  • Source maps add almost nothing — the 6 number fields (startLine through endOffset) per element are cheap.
  • Semantic refraction dominates — ~20-21 MB for the same data regardless of JSON or YAML origin.
  • Multiplier vs string differs by format — YAML shows ~37x, JSON shows ~59x for the same data, because YAML strings are larger (more whitespace) for identical content.

Element Counts (73.4 MB file)

Metric Generic Tree Semantic Tree
Total elements 3,003,457 3,003,457
MemberElements 888,408 888,408
Meta materialized 0 1,491,815
Attributes materialized 0 0

The element counts are identical. The difference is entirely in meta materialization.

Root Cause: Meta Materialization

How meta materialization works

Each Element has a lazily-initialized _meta property. When code accesses .classes, .meta.set(), or similar, it triggers creation of a full ObjectElement tree:

_meta = ObjectElement                  (~80 bytes)
  └── MemberElement                    (~80 bytes)
       └── KeyValuePair               (~32 bytes)
            ├── StringElement (key)    (~80 bytes)
            └── ArrayElement (value)   (~80 bytes)
               └── StringElement       (~80 bytes)

~6 Element objects (~430 bytes) per materialization.

Top meta materialization sources

Source Elements Affected Savings if Removed
FixedFieldsVisitor: newMemberElement.classes.push('fixed-field') ~591K -568 MB
PatternedFieldsVisitor: newMemberElement.classes.push('patterned-field') (included above) (included above)
JSONSchemaVisitor.handleDialectIdentifier: this.element.meta.set('inheritedDialectIdentifier', ...) ~251K -272 MB
JSONSchemaVisitor.handleSchemaIdentifier: this.element.meta.set('ancestorsSchemaIdentifiers', ...) (included above) (included above)
copyMetaAndAttributes propagation ~N/A -149 MB
Other visitors (specification-extension, content, parameters, reference-element, path-template, etc.) ~649K remaining ~200 MB

Meta materialization by element type (semantic tree)

Element Type Count
schema 251,530
member 233,163
string 224,913
array 159,347
object 22,135
response 3,171
mediaType 2,992
other 3,317
Total 900,537 (after disabling fixed/patterned field classes)

Memory Breakdown (73.4 MB file, original unpatched)

Component MB % of Total
Raw string 147 5%
POJO (JSON.parse) 52 2%
Generic tree (from explicit baseRefract call) 408 15%
Generic tree (internal to refractOpenApi3_1) ~408 15%
Semantic tree (deep copy of elements) ~408 15%
Meta materialization (~9M extra Element objects) ~989 37%
Visitor instances + traversal overhead ~274 10%
Total ~2,686 100%

Meta materialization accounts for 37% of total memory.

Why cloneDeep Is Not the Problem

Making cloneDeep a no-op (identity function) saved only ~1 MB. This is because:

  • With cloneDeep active: originals are created, clones are created, originals are GC'd after refraction. Final state: clones.
  • With cloneDeep as no-op: originals are shared into the semantic tree. No clones created, but originals can't be GC'd (still referenced). Final state: originals.

Same number of live objects in both cases. cloneDeep determines which objects survive, not how many.

Why Generic ApiDOM Is ~8x the POJO

Every JSON object property "key": value creates 4 heap objects:

MemberElement       (~80 bytes)
  KeyValuePair      (~32 bytes)
    StringElement   (~80 bytes)  ← key
    StringElement   (~80 bytes)  ← value

vs. a POJO property: one hidden class slot (~8 bytes) + string pointer.

This 8x overhead is reasonable for a fully-typed element tree where every node is individually addressable and can carry metadata and source positions.

Optimization Opportunities

1. Cheap classes storage (highest impact, no mutation required)

Savings: ~989 MB (37% of total)

Replace the full ObjectElement-based meta materialization for classes with a lightweight storage mechanism (bitfield, Set, or simple array) directly on the Element instance:

// instead of (creates ~6 Element objects):
newMemberElement.classes.push('fixed-field');

// use a lightweight property:
newMemberElement._classes = FIXED_FIELD_BIT;
// or
newMemberElement._classList = ['fixed-field'];

The same principle applies to all meta usage — meta currently uses the full Element tree to store what could be a simple key-value lookup.

2. Lightweight schema metadata (no mutation required)

Savings: ~272 MB

inheritedDialectIdentifier and ancestorsSchemaIdentifiers are stored via meta.set() on every Schema element, creating ~12 Element objects per schema. These could use a plain Map or direct properties while keeping the same self-contained design (schemas work in isolation without walking the parent chain).

3. Progressive erasure with mutable opt-in (requires controlled mutation)

Savings: ~408 MB peak memory

As the visitor processes each generic node and creates the semantic equivalent, null out the generic node so GC can reclaim it:

Before refraction:   [generic: 100%] [semantic: 0%]    = 1x
Mid refraction:      [generic: 50%]  [semantic: 50%]   = 1x
After refraction:    [generic: 0%]   [semantic: 100%]  = 1x

This eliminates the 2x peak from holding both trees simultaneously. Combined with skipping cloneDeep (move semantics instead of copy), the generic tree's elements are transferred to the semantic tree rather than duplicated.

Design: refract*() functions stay immutable by default. Parser adapters opt into mutable/consuming mode since they own the generic tree and know nobody else holds a reference:

// public API: immutable (safe for direct callers)
refractOpenApi3_1(genericTree);

// parser adapter internals: mutable (memory efficient)
refractOpenApi3_1(result, { consume: true });

This extends the same serial consume-and-discard pattern already used by CSTTransformer and YAMLASTTransformer one step further to the generic → semantic boundary.

Combined impact estimate (73.4 MB file)

Configuration Total Heap Multiplier
Original (no optimizations) ~2,278 MB ~31x
+ Cheap classes storage ~1,289 MB ~17.6x
+ Lightweight schema metadata ~1,017 MB ~13.9x
+ Progressive erasure (consume mode) ~609 MB ~8.3x
Theoretical floor (semantic tree only) ~440 MB ~6x

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions