Inline small contents in nodes (cleaned up)#2400
Open
balat wants to merge 29 commits intomirage:eiofrom
Open
Conversation
30b8b6f to
b736af0
Compare
f1f18be to
7d0e701
Compare
8 tasks
Add support for inlining small content values directly within parent node entries (Inode_v3 format with magic bytes S/T), rather than storing them as separate pack file entries. Co-Authored-By: gwenaelle <charles-edouard.lecat@epitech.eu> Co-Authored-By: Cuihtlauac ALVARADO <cuihtlauac@tarides.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmark compares write and read performance with and without inline_contents enabled, for different content sizes (small, medium, large). Usage: dune exec bench/irmin-pack/bench_inlined_contents.exe -- -n 2000 -c 5 -s small Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This simplifies the tree type by removing Contents_inlined from the internal representation. The inlining decision is now made at export time based on serialized content size. Changes: - Remove Contents_inlined from tree.t, tree.elt, and related types - Add should_inline_contents helper in export function - Inline contents <= 16 bytes at serialization time - Simplify pattern matches throughout the codebase - Remove debug print statements The backend kinded_hash/kinded_key types still support Contents_inlined for the on-disk format. Import functions convert inlined contents back to regular Contents values. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use Contents.to_value instead of cached_value to ensure value is available - Add 2 bytes overhead to size check (1 byte variant tag + 1 byte varint length) - This fixes the inlining structure test which expects 3 inlined contents Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New benchmark suite in test/irmin-pack/bench_inline/ to measure inline contents performance across different size distributions: - Uniform (small, large, around-threshold) - Bimodal (mostly-small, mostly-large) - Log-normal (small, medium) - Zipfian (small, medium, steep) Metrics collected: - Store size on disk - Write time - Read latency percentiles (p50, p99, p999) - Inlined vs non-inlined content counts Usage: dune exec test/irmin-pack/bench_inline/main.exe -- run --all dune exec test/irmin-pack/bench_inline/main.exe -- sweep -t 16 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmarks across multiple size distributions (zipfian, log-normal, bimodal) showed that 48 bytes provides better read latency improvement than 16 bytes across most workloads. Changes: - Update default inline_contents_max_bytes from 16 to 48 - Make threshold configurable via set_inline_contents_max_bytes - Add get_inline_contents_max_bytes to query current threshold - Add 'optimize' command to benchmark for threshold tuning - Update test expectations for new threshold Benchmark results showed: - zipfian-steep: p99 latency 139µs -> 84µs at threshold=48 - zipfian-small: p99 latency 166µs -> 108µs at threshold=48 - log-normal-small: p99 latency 229µs -> 111µs at threshold=48 - mostly-small: p99 latency 162µs -> 87µs at threshold=48 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add README.md with usage instructions for all benchmark commands - Add gnuplot scripts for generating comparison charts - Add sample data file and generated PNG plots Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add bench_inline_storage.gp gnuplot script for storage visualization - Add generate_data.sh to parse CSV output into gnuplot data files - Add 'make bench-inline' target to run benchmarks and generate plots - Update README with storage impact documentation and quick start Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add flush before close in Repo.close for read-write stores. Previously, direct node operations that didn't go through batch() could leave unflushed data, causing close to fail with Pending_flush. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The kinded_key type changed to support inline contents: - `Contents of key * metadata` (was `Contents of key`) - `Contents_inlined of string * metadata` (new) - `Node of key * key list` (was `Node of key`) This commit fixes all call sites to use the new tuple format: - save_tree now returns correct kinded_key with metadata - Store.key returns full Tree.kinded_key - Pattern matches updated in examples and tests - stat.t output updated for Inode_v3 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This flag was added by the inline contents PR but disables all warnings-as-errors globally, masking legitimate warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This debug executable with hardcoded /tmp path and lavyek dependency was introduced by mistake from another branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Debug breakpoints left over from development. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This safety check was commented out during inline contents development. Restore it to prevent persisting empty nodes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The v function was silently dropping Contents_inlined entries via filter_map, while add was already raising failwith. Align both to fail explicitly since inlined contents are not supported by the git backend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of returning null for Contents_inlined, deserialize the inlined bytes back to a content value using of_bin_string. Note: a libirmin test for this case is still missing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The `Node of node_key * contents_key list` type was a vestige of an earlier design where inlining was decided at tree construction time. After the inlining decision was moved to the export/serialization layer (commit 76a3de3), this list was never populated (always []) but was threaded through the entire codebase. Remove it along with the related `to_inline` parameter from of_seq, of_list, and other functions. This eliminates all "TODO inline" comments that were marking the unpopulated list sites. The `Node` variant is now simply `Node of node_key` in kinded_key, kinded_hash, Snapshot.kinded_hash, and Concrete.kinded_key types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add inline_contents_max_bytes to Backend.S.Repo signature. Each backend returns its inlining threshold (0 = disabled). Tree.export reads it from the repo instead of global Atomic.t variables. This fixes the issue where opening two repos with different inline settings would cause the second to overwrite the first's config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
With inlining disabled (default), the Hash_preimage produces the same pre-hash as before the inline contents feature. The new variants (Contents_inlined_hash, Contents_inlined_value) are appended at the end of the variant type, so existing tags are unchanged. Restore original V3 test archives and hash constants from the eio branch. Re-enable the From_v2 upgrade test case. Also fix remaining of_list [] calls missed by previous refactoring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update design doc: correct threshold (48 bytes, not 16), document per-repo config via Backend.Repo.inline_contents_max_bytes, remove references to old tree-level inlining design, update Node.value type - Update version-history: correct threshold description Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add inline_contents_max_bytes configuration key to Irmin_pack.Conf (default 48). This replaces the previously hardcoded threshold. Usage: Irmin_pack.config ~inline_contents:true ~inline_contents_max_bytes:32 root Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The inline contents PR replaced Irmin_tezos.Conf with local copies that had contents_length_header = None instead of Some `Varint. This was unnecessary since irmin-tezos is already a dependency. Also remove commented-out dead code from show.ml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test the inlined-x-i / inlined-x-d serialization variants which encode inlined contents with explicit (non-default) metadata. These code paths were previously untested. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validate that inline_contents_max_bytes is non-negative to prevent invalid configurations that could lead to unexpected behavior. The validation is performed during configuration initialization and raises an Invalid_argument exception with a clear error message if the value is negative. Generated by Mistral Vibe. Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
Clarify that inline_contents_max_bytes is configurable and document the validation process added for this configuration parameter. Adds: - Example configuration code - Configuration validation section - Updated testing section to include validation tests - Clarification that the threshold can be customized per repository Generated by Mistral Vibe. Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
Verify that inline_contents_max_bytes=0 with inline_contents=true correctly disables inlining. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add inline_contents_max_bytes to client Repo (returns 0) - Simplify Node.merge type in client (key instead of key * key list) - Simplify node_with_inlined type alias in server (key instead of key * key list) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7d0e701 to
6a552e8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR continues and cleans up the inline contents feature for
irmin-pack, building on:I’ve tried to tidy up the commits as much as possible, avoiding back-and-forth changes wherever possible to make them easier to review
Small content values (configurable threshold, default 48 bytes serialized) are stored directly within parent node entries rather than as separate pack file entries.
Reviewing notes
The commits have been tidied up from #2390 but one deliberate back-and-forth remains in the history:
The first commit (
Implement inline contents) adds acontents_key listto theNodevariant:`Node of node_key * contents_key list. This was part of the original design by @clecat where the inlining decision was made at tree construction time — the list was meant to track which contents had been inlined within each node, so that the GC, integrity checks, and export could handle them.A later commit (
Move inlining decision to export/serialization layer) changed the architecture so inlining is decided at export time instead, viashould_inline_contents. This made the list redundant — it was never populated (always[]) but was still threaded through the entire codebase.The commit
Remove unused contents_key list from Node variantremoves it, simplifying 42 files and eliminating allTODO inlinecomments that marked the empty-list sites.We could not squash these together without also squashing all intermediate commits (benchmarks, threshold changes, etc.), which would have made the history less readable. We chose to keep the intermediate steps visible.
Changes relative to #2390
Cleanup
-warn-error -Afrom root dune configbin/(lavyek dependency, from another branch)assert falsedebug breakpointsIrmin_tezos.Confin bench and tezos_explorer (PR had replaced it with local copies using wrongcontents_length_header)Bug fixes
forbid_empty_dir_persistenceguard (was commented out)Contents_inlinedin irmin-git (was silently dropping data)contents_of_key(was returning null)Architecture improvements
contents_key listfromNodevariant — see reviewing notes aboveAtomic.twith per-repo config — addBackend.Repo.inline_contents_max_bytesso each repo carries its own inlining config. This fixes the issue where opening two repos with different settings would overwrite each other's config.inline_contents_max_bytestoIrmin_pack.Conf(default 48)inline_contents_max_bytesvaluesTests
From_v2upgrade test (was disabled)inlined-x-i/inlined-x-dvariants)inline_contents_max_bytesHistory cleanup
multicore-safetybranch)Atomic.tcommit (superseded by per-repo config)Formattingcommits into oneTest plan
From_v2upgrade test re-enabled and passingContents_inlined(requires ctypes, not available in current env)