Skip to content

Conversation

@samuelstroschein
Copy link
Member

@samuelstroschein samuelstroschein commented Sep 5, 2025

Closes opral/lix-sdk#366.

Reduces the changes per mutation from 30 to 3, a 90% decrease. Besides storage reductions, benchmarks improved across the bench:

Commit ca. 1.3x faster

· commit transaction with 100 rows                     13.6236  71.6583  76.1044  73.4020  73.9867  76.1044  76.1044  76.1044   ±1.28%       10  [1.33x] ⇑
     commit transaction with 100 rows                     10.2707  94.0025   102.49  97.3643  99.7401   102.49   102.49   102.49   ±2.43%       10  (baseline)
   · commit 10 transactions x 10 changes (sequential)      9.7247  99.4000   110.49   102.83   103.38   110.49   110.49   110.49   ±2.16%       10  [1.33x] ⇑
     commit 10 transactions x 10 changes (sequential)      7.2879   125.19   191.97   137.21   133.45   191.97   191.97   191.97  ±10.12%       10  (baseline)

version diffs are 3.5x faster

 ✓ src/version/select-version-diff.bench.ts 5721ms
     name                                        hz     min     max    mean     p75     p99    p995    p999     rme  samples
   · selectVersionDiff (exclude unchanged)   7.3192  134.54  139.88  136.63  138.21  139.88  139.88  139.88  ±1.04%       10  [3.50x] ⇑
     selectVersionDiff (exclude unchanged)   2.0934  468.55  495.41  477.68  478.46  495.41  495.41  495.41  ±1.14%       10  (baseline)
   · selectVersionDiff (full document diff)  7.2150  135.61  143.04  138.60  138.95  143.04  143.04  143.04  ±1.07%       10  [3.51x] ⇑
     selectVersionDiff (full document diff)  2.0576  477.76  500.08  486.01  488.02  500.08  500.08  500.08  ±0.96%       10  (baseline)

Write Amplification → Incremental Optimization Plan

Goal: Reduce change rows per user mutation while staying compatible with today’s flows (create-checkpoint, history-by-commit) and without introducing a new “commit package” until it clearly adds value.

Scope: 1 domain mutation on a single version, 1 author, typical commit (no merge).

Notation (per‑commit complexity)

  • D: number of domain changes e.g. user triggers a key value update
  • P: number of parent commits (size of parent_commit_ids). Usually 1; >1 for merges.
  • M: number of distinct authors associated with the commit.

Baseline (Status Quo)

Observed new change rows for a single domain mutation:

  • Total: 30
  • By schema (approximate from logs):
    • lix_key_value: 1
    • lix_change_author: 1
    • lix_change_set_element: 20
    • lix_change_set: 2
    • lix_commit: 2
    • lix_commit_edge: 2
    • lix_version: 2

Notes:

  • 20 CSE rows account for most of the amplification (hot-path index writes).
  • The remaining rows are “meta” (commit, edges, version, change set), partly duplicated across scopes (e.g., version + global).

Step 1 — Derive CSEs (no hot‑path CSE writes)

  • Change: Stop inserting lix_change_set_element on commit. Derive domain‑only CSEs in a materializer/view from commit membership (initially commit.change_ids; later, the commit package’s domain_change_ids).
  • Rationale: CSEs are an index over change_set ↔ change and can be synthesized cheaply; writing them explodes cost linearly with the number of changes.
  • Compatibility:
    • state-history: It joins change_set_element_all. Provide a compatibility view that unions physical rows (for old commits) with derived rows (for new commits), or switch to the derived view behind the existing name.
    • create-checkpoint: Unchanged. It references commit.change_set_id and does not require physical CSE rows.
  • Estimated delta: −20 rows (30 → 10).
  • Materializer complexity: per commit O(D) to enumerate domain changes (unchanged), but removes O(D) storage writes and index maintenance. state-history can read derived CSEs in O(D) via JSON extraction instead of table joins.

Step 2 — Replace lix_commit_edge rows with derived edges

  • Change: Stop inserting lix_commit_edge rows. Keep parent_commit_ids inside the lix_commit snapshot and expose a view that explodes parents into an edge shape for queries.
  • Rationale: Parent edges are derivable from commit snapshots; no need for extra rows per parent.
  • Compatibility:
    • state-history: Today it joins commit_edge_all. Provide a compatibility view commit_edge_all that explodes commit_all.parent_commit_ids to (parent_id, child_id) so existing queries continue to work.
    • create-checkpoint: Still emits a parent relationship; it can write a no‑op (or rely on the derived view).
  • Estimated delta: −2 rows (10 → 8).
  • Materializer complexity: ancestry traversal stays O(P) parents per commit (unchanged), but eliminates O(P) storage writes and reduces join cost to JSON array scan (lower constants).

Step 3 — Drop Dual Commit (de‑duplicate commit, version, and change_set)

  • Change: Stop emitting the “global” duplicate for graph metadata. For each mutation, persist exactly one set of graph rows tied to the mutated version:
    • lix_commit: only the version’s commit (no global duplicate)
    • lix_version: only the mutated version’s tip move (no global duplicate)
    • lix_change_set: only one row per commit (no second/global duplicate)
  • Rationale: The dual‑commit model duplicates meta across scopes. The graph topology is global, but does not require duplicate change rows; views/materializer can project what’s needed.
  • Compatibility:
    • create-checkpoint: Unchanged. It updates version.commit_id and version.working_commit_id and labels the checkpoint.
    • state-history: Unchanged. Edges are derived from parent_commit_ids and CSEs are derived/materialized; both remain global in views/cache.
  • Estimated delta: −3 rows (8 → 5) from Step 2’s baseline.
  • Notes:
    • Make edge materialization unconditional: always materialize lix_commit_edge in the global scope from parent_commit_ids so cache/queries do not depend on a “global” commit change row.
  • Materializer complexity: no change in O(D + P); reduces constants by removing duplicate scans.

Result After Steps 1–3 (1 domain mutation)

  • lix_key_value: 1
  • lix_change_author: 1
  • lix_change_set: 1
  • lix_commit: 1 (public, minimal; includes change_set_id)
  • lix_version: 1
  • Derived (not inserted):
    • lix_change_set_element (global) from commit membership
    • lix_commit_edge from parent_commit_ids
  • Total inserted rows: ~5 (down from 30)

Step 4 (Optional) — Author normalization for multi‑change commits

  • Change: Keep commit‑level authors and expand to per‑change via a view joined through CSEs. Only implement if you want to shrink rows when a commit touches many domain changes with the same authors.
  • Rationale: For N changes and M authors, physical rows go from N×M → M.
  • Compatibility:
    • Per‑change authorship can remain materialized if desired; otherwise provide a compatibility view change_author_all.
  • Estimated delta: 0 for single‑change commits; potentially large savings for multi‑change commits.
  • Materializer complexity: per‑commit author aggregation can be computed in O(D + M) (domain changes + authors) via CSE join, instead of reading O(D×M) physical rows.

Step 5 (Later) — Introduce meta_change_ids

  • Add meta_change_ids to lix_commit that carries commit_id, parent_commit_ids, change_set_id, and split membership: domain_change_ids vs meta_change_ids.
  • Benefits:
    • Clear separation of “how we materialize” vs “what we expose”.
    • Enables further internal flexibility (e.g., deterministic working heads, background backfills) without API churn.
  • Estimated delta: No immediate row reduction versus Step 1 (CSEs already derived), but simplifies long‑term evolution.
  • Materializer complexity: keeps per‑commit apply at O(D + P); decoupling enables smaller, more targeted scans (lower constants) and easier checkpointing.

Potential follow‑up: Unify version pointers under control (tip)

  • Today, commit_id (version tip) lives in the control ledger (lix_version_tip), while working_commit_id lives in the descriptor (lix_version_descriptor).
  • This split forces write‑paths (e.g., commit.ts) to read descriptor just to obtain working_commit_id while also reading tip for commit_id.
  • Proposal: Move working_commit_id into the control plane (extend lix_version_tip or add a sibling control entity to carry the working pointer). Keep descriptor purely domain (id, name, inherits, hidden).
  • Benefits: Single source for pointers, simpler commit logic (no descriptor fetch for pointer logic), and a clearer domain/control separation consistent with commit‑anchored tips.

Compatibility Summary

  • create-checkpoint: Remains valid throughout. It needs a real commit with a change_set_id, and will continue to link the previous head as a checkpoint and create a new empty working commit.
  • state-history: Continue to “query by commit” by ensuring two compatibility views exist when steps land:
    • commit_edge_all (derived from commit_all.parent_commit_ids).
    • change_set_element_all (derived for new commits; union with physical for legacy).

Rollout Guidance

  • Ship Steps 1 and 2 first. Add compatibility views and tests; ensure commit edges are materialized globally from parent_commit_ids.
  • Ship Step 3 “Drop Dual Commit”: remove the global duplicates for commit/version/change_set and keep edge/CSE derivation intact.
  • Consider Step 5 when you want to slim public commit snapshots and decouple lineage/membership fully.

Cleanup TODOs

  • Merge filter cleanup: In packages/lix-sdk/src/version/merge-version.ts, we temporarily filter control/meta schemas out of winners/deletions to keep commit membership deterministic under cache-miss. Do not blanket-filter lix_*; some are valid domain (e.g., lix_key_value, lix_file_descriptor). Remove this filter after Step 5 introduces meta_change_ids and formally splits domain vs meta membership.

@changeset-bot
Copy link

changeset-bot bot commented Sep 5, 2025

⚠️ No Changeset found

Latest commit: 0c3e83c

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@samuelstroschein samuelstroschein temporarily deployed to lixdk-460-incremental - lix-docs PR #3688 September 5, 2025 23:15 — with Render Destroyed
@nx-cloud
Copy link

nx-cloud bot commented Sep 5, 2025

🤖 Nx Cloud AI Fix Eligible

An automatically generated fix could have helped fix failing tasks for this run, but Self-healing CI is disabled for this workspace. Visit workspace settings to enable it and get automatic fixes in future runs.

To disable these notifications, a workspace admin can disable them in workspace settings.


View your CI Pipeline Execution ↗ for commit 0c3e83c

Command Status Duration Result
nx run-many --target=test --parallel ❌ Failed 9m 4s View ↗
nx run-many --target=lint --parallel ✅ Succeeded 50s View ↗

☁️ Nx Cloud last updated this comment at 2025-09-05 23:26:53 UTC

@samuelstroschein samuelstroschein merged commit 2ecf4f5 into main Sep 6, 2025
2 of 4 checks passed
@samuelstroschein samuelstroschein deleted the lixdk-460-incremental branch September 6, 2025 00:02
@github-actions github-actions bot locked and limited conversation to collaborators Sep 6, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

introduce commit package entity to reduce storag by up to 80%

2 participants