-
Notifications
You must be signed in to change notification settings - Fork 0
Description
ApiDOM Memory Footprint Analysis
Test Subject
- File: GitHub REST API OpenAPI description (
api.github.com.2022-11-28.deref.json) - Format: OpenAPI 3.1.0, fully dereferenced
- Size: 73.4 MB (JSON), 54.2 MB (YAML equivalent)
- Node.js: v24.10.0, V8
Pipeline Overview
JSON path (strict mode)
JSON string → JSON.parse (POJO) → baseRefract (Generic ApiDOM) → refractOpenApi3_1 (Semantic ApiDOM)
YAML path (tree-sitter)
YAML string → tree-sitter CST → CSTTransformer → YAML AST → YAMLASTTransformer → Generic ApiDOM → refractOpenApi3_1 (Semantic ApiDOM)
CSTTransformer and YAMLASTTransformer run serially — each stage is consumed before the next produces output. CST and YAML AST do not accumulate in memory simultaneously.
Measurement Results
JSON Pipeline (73.4 MB file)
| Stage | Incremental Cost | Cumulative Heap | Multiplier vs String |
|---|---|---|---|
| Raw string | +147 MB | 147 MB | 2.0x |
| JSON.parse (POJO) | +52 MB | 199 MB | 2.7x |
| baseRefract (generic ApiDOM) | +408 MB | 607 MB | 8.3x |
| refractOpenApi3_1 (semantic ApiDOM) | +2,064 MB | 2,671 MB | 36.4x |
Note: the semantic refraction delta (+2,064 MB) includes an internal baseRefract call (~408 MB for a second generic tree) plus the semantic tree construction.
Realistic single-call to refractOpenApi3_1: ~2,278 MB total, ~31x the string.
YAML Pipeline (0.73 MB file, 30 paths subset)
| Stage | No Source Maps | With Source Maps |
|---|---|---|
| CST (tree-sitter) | ~0 MB | ~0 MB |
| Generic ApiDOM | +6.5 MB | +6.2 MB |
| Semantic ApiDOM | +21.0 MB | +20.4 MB |
| Total | 27.2 MB (37x) | 26.6 MB (36x) |
tree-sitter-yaml limitation: max 32,768 lines per file (see tree-sitter-yaml#35).
Key Observations
- CST cost is negligible — tree-sitter allocates in native/WASM memory, not on the V8 heap.
- Source maps add almost nothing — the 6 number fields (
startLinethroughendOffset) per element are cheap. - Semantic refraction dominates — ~20-21 MB for the same data regardless of JSON or YAML origin.
- Multiplier vs string differs by format — YAML shows ~37x, JSON shows ~59x for the same data, because YAML strings are larger (more whitespace) for identical content.
Element Counts (73.4 MB file)
| Metric | Generic Tree | Semantic Tree |
|---|---|---|
| Total elements | 3,003,457 | 3,003,457 |
| MemberElements | 888,408 | 888,408 |
| Meta materialized | 0 | 1,491,815 |
| Attributes materialized | 0 | 0 |
The element counts are identical. The difference is entirely in meta materialization.
Root Cause: Meta Materialization
How meta materialization works
Each Element has a lazily-initialized _meta property. When code accesses .classes, .meta.set(), or similar, it triggers creation of a full ObjectElement tree:
_meta = ObjectElement (~80 bytes)
└── MemberElement (~80 bytes)
└── KeyValuePair (~32 bytes)
├── StringElement (key) (~80 bytes)
└── ArrayElement (value) (~80 bytes)
└── StringElement (~80 bytes)
~6 Element objects (~430 bytes) per materialization.
Top meta materialization sources
| Source | Elements Affected | Savings if Removed |
|---|---|---|
FixedFieldsVisitor: newMemberElement.classes.push('fixed-field') |
~591K | -568 MB |
PatternedFieldsVisitor: newMemberElement.classes.push('patterned-field') |
(included above) | (included above) |
JSONSchemaVisitor.handleDialectIdentifier: this.element.meta.set('inheritedDialectIdentifier', ...) |
~251K | -272 MB |
JSONSchemaVisitor.handleSchemaIdentifier: this.element.meta.set('ancestorsSchemaIdentifiers', ...) |
(included above) | (included above) |
copyMetaAndAttributes propagation |
~N/A | -149 MB |
Other visitors (specification-extension, content, parameters, reference-element, path-template, etc.) |
~649K remaining | ~200 MB |
Meta materialization by element type (semantic tree)
| Element Type | Count |
|---|---|
| schema | 251,530 |
| member | 233,163 |
| string | 224,913 |
| array | 159,347 |
| object | 22,135 |
| response | 3,171 |
| mediaType | 2,992 |
| other | 3,317 |
| Total | 900,537 (after disabling fixed/patterned field classes) |
Memory Breakdown (73.4 MB file, original unpatched)
| Component | MB | % of Total |
|---|---|---|
| Raw string | 147 | 5% |
| POJO (JSON.parse) | 52 | 2% |
| Generic tree (from explicit baseRefract call) | 408 | 15% |
| Generic tree (internal to refractOpenApi3_1) | ~408 | 15% |
| Semantic tree (deep copy of elements) | ~408 | 15% |
| Meta materialization (~9M extra Element objects) | ~989 | 37% |
| Visitor instances + traversal overhead | ~274 | 10% |
| Total | ~2,686 | 100% |
Meta materialization accounts for 37% of total memory.
Why cloneDeep Is Not the Problem
Making cloneDeep a no-op (identity function) saved only ~1 MB. This is because:
- With
cloneDeepactive: originals are created, clones are created, originals are GC'd after refraction. Final state: clones. - With
cloneDeepas no-op: originals are shared into the semantic tree. No clones created, but originals can't be GC'd (still referenced). Final state: originals.
Same number of live objects in both cases. cloneDeep determines which objects survive, not how many.
Why Generic ApiDOM Is ~8x the POJO
Every JSON object property "key": value creates 4 heap objects:
MemberElement (~80 bytes)
KeyValuePair (~32 bytes)
StringElement (~80 bytes) ← key
StringElement (~80 bytes) ← value
vs. a POJO property: one hidden class slot (~8 bytes) + string pointer.
This 8x overhead is reasonable for a fully-typed element tree where every node is individually addressable and can carry metadata and source positions.
Optimization Opportunities
1. Cheap classes storage (highest impact, no mutation required)
Savings: ~989 MB (37% of total)
Replace the full ObjectElement-based meta materialization for classes with a lightweight storage mechanism (bitfield, Set, or simple array) directly on the Element instance:
// instead of (creates ~6 Element objects):
newMemberElement.classes.push('fixed-field');
// use a lightweight property:
newMemberElement._classes = FIXED_FIELD_BIT;
// or
newMemberElement._classList = ['fixed-field'];The same principle applies to all meta usage — meta currently uses the full Element tree to store what could be a simple key-value lookup.
2. Lightweight schema metadata (no mutation required)
Savings: ~272 MB
inheritedDialectIdentifier and ancestorsSchemaIdentifiers are stored via meta.set() on every Schema element, creating ~12 Element objects per schema. These could use a plain Map or direct properties while keeping the same self-contained design (schemas work in isolation without walking the parent chain).
3. Progressive erasure with mutable opt-in (requires controlled mutation)
Savings: ~408 MB peak memory
As the visitor processes each generic node and creates the semantic equivalent, null out the generic node so GC can reclaim it:
Before refraction: [generic: 100%] [semantic: 0%] = 1x
Mid refraction: [generic: 50%] [semantic: 50%] = 1x
After refraction: [generic: 0%] [semantic: 100%] = 1x
This eliminates the 2x peak from holding both trees simultaneously. Combined with skipping cloneDeep (move semantics instead of copy), the generic tree's elements are transferred to the semantic tree rather than duplicated.
Design: refract*() functions stay immutable by default. Parser adapters opt into mutable/consuming mode since they own the generic tree and know nobody else holds a reference:
// public API: immutable (safe for direct callers)
refractOpenApi3_1(genericTree);
// parser adapter internals: mutable (memory efficient)
refractOpenApi3_1(result, { consume: true });This extends the same serial consume-and-discard pattern already used by CSTTransformer and YAMLASTTransformer one step further to the generic → semantic boundary.
Combined impact estimate (73.4 MB file)
| Configuration | Total Heap | Multiplier |
|---|---|---|
| Original (no optimizations) | ~2,278 MB | ~31x |
| + Cheap classes storage | ~1,289 MB | ~17.6x |
| + Lightweight schema metadata | ~1,017 MB | ~13.9x |
| + Progressive erasure (consume mode) | ~609 MB | ~8.3x |
| Theoretical floor (semantic tree only) | ~440 MB | ~6x |