Skip to content

Inline small contents in nodes (cleaned up)#2400

Open
balat wants to merge 29 commits intomirage:eiofrom
balat:cuihtlauac-inline-small-objects-v2
Open

Inline small contents in nodes (cleaned up)#2400
balat wants to merge 29 commits intomirage:eiofrom
balat:cuihtlauac-inline-small-objects-v2

Conversation

@balat
Copy link
Copy Markdown
Contributor

@balat balat commented Apr 3, 2026

Summary

This PR continues and cleans up the inline contents feature for irmin-pack, building on:

I’ve tried to tidy up the commits as much as possible, avoiding back-and-forth changes wherever possible to make them easier to review

Small content values (configurable threshold, default 48 bytes serialized) are stored directly within parent node entries rather than as separate pack file entries.

Reviewing notes

The commits have been tidied up from #2390 but one deliberate back-and-forth remains in the history:

  1. The first commit (Implement inline contents) adds a contents_key list to the Node variant: `Node of node_key * contents_key list. This was part of the original design by @clecat where the inlining decision was made at tree construction time — the list was meant to track which contents had been inlined within each node, so that the GC, integrity checks, and export could handle them.

  2. A later commit (Move inlining decision to export/serialization layer) changed the architecture so inlining is decided at export time instead, via should_inline_contents. This made the list redundant — it was never populated (always []) but was still threaded through the entire codebase.

  3. The commit Remove unused contents_key list from Node variant removes it, simplifying 42 files and eliminating all TODO inline comments that marked the empty-list sites.

We could not squash these together without also squashing all intermediate commits (benchmarks, threshold changes, etc.), which would have made the history less readable. We chose to keep the intermediate steps visible.

Changes relative to #2390

Cleanup

  • Remove -warn-error -A from root dune config
  • Remove debug executable from bin/ (lavyek dependency, from another branch)
  • Remove commented-out assert false debug breakpoints
  • Restore Irmin_tezos.Conf in bench and tezos_explorer (PR had replaced it with local copies using wrong contents_length_header)

Bug fixes

  • Restore forbid_empty_dir_persistence guard (was commented out)
  • Fail explicitly on Contents_inlined in irmin-git (was silently dropping data)
  • Deserialize inlined contents in libirmin contents_of_key (was returning null)

Architecture improvements

  • Remove contents_key list from Node variant — see reviewing notes above
  • Replace global Atomic.t with per-repo config — add Backend.Repo.inline_contents_max_bytes so each repo carries its own inlining config. This fixes the issue where opening two repos with different settings would overwrite each other's config.
  • Make threshold configurable — add inline_contents_max_bytes to Irmin_pack.Conf (default 48)
  • Validate configuration — reject negative inline_contents_max_bytes values

Tests

  • Re-enable From_v2 upgrade test (was disabled)
  • Restore original V3 test archives and hash constants (hashes are identical to pre-inlining when inlining is disabled)
  • Add test for inlining with non-default metadata (exercises inlined-x-i/inlined-x-d variants)
  • Add validation and edge case tests for inline_contents_max_bytes

History cleanup

  • Remove thread-safety commits (belong in multicore-safety branch)
  • Remove Atomic.t commit (superseded by per-repo config)
  • Squash first 6 implementation commits into one (too many back-and-forth changes to review separately)
  • Squash two Formatting commits into one

Test plan

  • All 258 irmin-pack tests pass
  • From_v2 upgrade test re-enabled and passing
  • Inline contents tests (7 tests): without/with inlining, equivalence, structure, metadata, validation, edge cases
  • libirmin C test for Contents_inlined (requires ctypes, not available in current env)

@balat balat force-pushed the cuihtlauac-inline-small-objects-v2 branch 3 times, most recently from 30b8b6f to b736af0 Compare April 3, 2026 12:40
@balat balat marked this pull request as ready for review April 3, 2026 12:42
@balat balat force-pushed the cuihtlauac-inline-small-objects-v2 branch from f1f18be to 7d0e701 Compare April 3, 2026 13:43
clecat and others added 24 commits April 15, 2026 12:10
Add support for inlining small content values directly within parent node entries (Inode_v3 format with magic bytes S/T), rather than storing them as separate pack file entries.

Co-Authored-By: gwenaelle <charles-edouard.lecat@epitech.eu>

Co-Authored-By: Cuihtlauac ALVARADO <cuihtlauac@tarides.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmark compares write and read performance with and without
inline_contents enabled, for different content sizes (small, medium, large).

Usage: dune exec bench/irmin-pack/bench_inlined_contents.exe -- -n 2000 -c 5 -s small

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This simplifies the tree type by removing Contents_inlined from the
internal representation. The inlining decision is now made at export
time based on serialized content size.

Changes:
- Remove Contents_inlined from tree.t, tree.elt, and related types
- Add should_inline_contents helper in export function
- Inline contents <= 16 bytes at serialization time
- Simplify pattern matches throughout the codebase
- Remove debug print statements

The backend kinded_hash/kinded_key types still support Contents_inlined
for the on-disk format. Import functions convert inlined contents back
to regular Contents values.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use Contents.to_value instead of cached_value to ensure value is available
- Add 2 bytes overhead to size check (1 byte variant tag + 1 byte varint length)
- This fixes the inlining structure test which expects 3 inlined contents

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New benchmark suite in test/irmin-pack/bench_inline/ to measure
inline contents performance across different size distributions:

- Uniform (small, large, around-threshold)
- Bimodal (mostly-small, mostly-large)
- Log-normal (small, medium)
- Zipfian (small, medium, steep)

Metrics collected:
- Store size on disk
- Write time
- Read latency percentiles (p50, p99, p999)
- Inlined vs non-inlined content counts

Usage:
  dune exec test/irmin-pack/bench_inline/main.exe -- run --all
  dune exec test/irmin-pack/bench_inline/main.exe -- sweep -t 16

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmarks across multiple size distributions (zipfian, log-normal,
bimodal) showed that 48 bytes provides better read latency improvement
than 16 bytes across most workloads.

Changes:
- Update default inline_contents_max_bytes from 16 to 48
- Make threshold configurable via set_inline_contents_max_bytes
- Add get_inline_contents_max_bytes to query current threshold
- Add 'optimize' command to benchmark for threshold tuning
- Update test expectations for new threshold

Benchmark results showed:
- zipfian-steep: p99 latency 139µs -> 84µs at threshold=48
- zipfian-small: p99 latency 166µs -> 108µs at threshold=48
- log-normal-small: p99 latency 229µs -> 111µs at threshold=48
- mostly-small: p99 latency 162µs -> 87µs at threshold=48

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add README.md with usage instructions for all benchmark commands
- Add gnuplot scripts for generating comparison charts
- Add sample data file and generated PNG plots

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add bench_inline_storage.gp gnuplot script for storage visualization
- Add generate_data.sh to parse CSV output into gnuplot data files
- Add 'make bench-inline' target to run benchmarks and generate plots
- Update README with storage impact documentation and quick start

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add flush before close in Repo.close for read-write stores.
Previously, direct node operations that didn't go through batch()
could leave unflushed data, causing close to fail with Pending_flush.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The kinded_key type changed to support inline contents:
- `Contents of key * metadata` (was `Contents of key`)
- `Contents_inlined of string * metadata` (new)
- `Node of key * key list` (was `Node of key`)

This commit fixes all call sites to use the new tuple format:
- save_tree now returns correct kinded_key with metadata
- Store.key returns full Tree.kinded_key
- Pattern matches updated in examples and tests
- stat.t output updated for Inode_v3

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This flag was added by the inline contents PR but disables all
warnings-as-errors globally, masking legitimate warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This debug executable with hardcoded /tmp path and lavyek dependency
was introduced by mistake from another branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Debug breakpoints left over from development.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This safety check was commented out during inline contents development.
Restore it to prevent persisting empty nodes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The v function was silently dropping Contents_inlined entries via
filter_map, while add was already raising failwith. Align both to
fail explicitly since inlined contents are not supported by the
git backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of returning null for Contents_inlined, deserialize the
inlined bytes back to a content value using of_bin_string.

Note: a libirmin test for this case is still missing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The `Node of node_key * contents_key list` type was a vestige of an
earlier design where inlining was decided at tree construction time.
After the inlining decision was moved to the export/serialization
layer (commit 76a3de3), this list was never populated (always [])
but was threaded through the entire codebase.

Remove it along with the related `to_inline` parameter from of_seq,
of_list, and other functions. This eliminates all "TODO inline"
comments that were marking the unpopulated list sites.

The `Node` variant is now simply `Node of node_key` in kinded_key,
kinded_hash, Snapshot.kinded_hash, and Concrete.kinded_key types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add inline_contents_max_bytes to Backend.S.Repo signature. Each
backend returns its inlining threshold (0 = disabled). Tree.export
reads it from the repo instead of global Atomic.t variables.

This fixes the issue where opening two repos with different inline
settings would cause the second to overwrite the first's config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
With inlining disabled (default), the Hash_preimage produces the same
pre-hash as before the inline contents feature. The new variants
(Contents_inlined_hash, Contents_inlined_value) are appended at the
end of the variant type, so existing tags are unchanged.

Restore original V3 test archives and hash constants from the eio
branch. Re-enable the From_v2 upgrade test case.

Also fix remaining of_list [] calls missed by previous refactoring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update design doc: correct threshold (48 bytes, not 16), document
  per-repo config via Backend.Repo.inline_contents_max_bytes, remove
  references to old tree-level inlining design, update Node.value type
- Update version-history: correct threshold description

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add inline_contents_max_bytes configuration key to Irmin_pack.Conf
(default 48). This replaces the previously hardcoded threshold.

Usage:
  Irmin_pack.config ~inline_contents:true ~inline_contents_max_bytes:32 root

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The inline contents PR replaced Irmin_tezos.Conf with local copies
that had contents_length_header = None instead of Some `Varint.
This was unnecessary since irmin-tezos is already a dependency.

Also remove commented-out dead code from show.ml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test the inlined-x-i / inlined-x-d serialization variants which
encode inlined contents with explicit (non-default) metadata. These
code paths were previously untested.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
balat and others added 5 commits April 15, 2026 12:10
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validate that inline_contents_max_bytes is non-negative to prevent
invalid configurations that could lead to unexpected behavior.

The validation is performed during configuration initialization and
raises an Invalid_argument exception with a clear error message if
the value is negative.

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
Clarify that inline_contents_max_bytes is configurable and document
the validation process added for this configuration parameter.

Adds:
- Example configuration code
- Configuration validation section
- Updated testing section to include validation tests
- Clarification that the threshold can be customized per repository

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
Verify that inline_contents_max_bytes=0 with inline_contents=true
correctly disables inlining.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add inline_contents_max_bytes to client Repo (returns 0)
- Simplify Node.merge type in client (key instead of key * key list)
- Simplify node_with_inlined type alias in server (key instead of
  key * key list)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@balat balat force-pushed the cuihtlauac-inline-small-objects-v2 branch from 7d0e701 to 6a552e8 Compare April 22, 2026 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants