Skip to content

Conversation

@aaronc
Copy link
Member

@aaronc aaronc commented Dec 4, 2025

Description

This PR specifies the IAVLX KV data file format for storing the WAL as well as other key-value data (branch node keys and compacted changeset KV data), and implements the KVDataReader, KVDataWriter and WALReader types. It also adds the convenience FileWriter and Mmap wrapper types.

One design question for reviewers is whether we should proactively limit key and value size. I would suggest a key limit of 2^16-1 (64KB) and a value limit of 2^24-1 (16MB). Currently, this KV data file uses 32-bit offsets which limits its size to 4gb before we have to roll over. When initially writing changesets, we should probably roll over around 1 or 2gb and then compact up to 4gb. If, however, while writing a version we ran out of space, the node would crash non-deterministically. This is unlikely to happen if we roll over at 1 or 2gb unless someone introduces some really large unexpected KV data. Setting a limit to key and value size would be consensus breaking (unlikely to ever get triggered in practice), but would make such pathological scenarios cause nodes to fail more deterministically based on validation rather than just running out of disk space. We could also explore larger offsets of 40-64bits, but the larger the kv.dat file is, the more extra disk space we need when doing compaction. And also really large key/value data should probably be considered pathological anyway. Any thoughts on all of this?

"fmt"
"math"
"os"
"unsafe"

Check notice

Code scanning / CodeQL

Sensitive package import Note

Certain system packages contain functions which may be a possible source of non-determinism
@codecov
Copy link

codecov bot commented Dec 5, 2025

Codecov Report

❌ Patch coverage is 85.25074% with 50 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.37%. Comparing base (fd82917) to head (0b14264).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
iavl/internal/kvdata_reader.go 82.97% 24 Missing ⚠️
iavl/internal/kvdata_writer.go 87.07% 19 Missing ⚠️
iavl/internal/file_writer.go 71.42% 4 Missing ⚠️
iavl/internal/mmap.go 93.33% 2 Missing ⚠️
iavl/internal/changeset_info.go 85.71% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #25645      +/-   ##
==========================================
+ Coverage   70.26%   70.37%   +0.11%     
==========================================
  Files         835      842       +7     
  Lines       54361    54888     +527     
==========================================
+ Hits        38196    38628     +432     
- Misses      16165    16260      +95     
Files with missing lines Coverage Δ
iavl/internal/leaf_layout.go 66.66% <ø> (ø)
iavl/internal/mem_node.go 95.16% <ø> (+0.71%) ⬆️
iavl/internal/changeset_info.go 84.21% <85.71%> (-4.03%) ⬇️
iavl/internal/mmap.go 93.33% <93.33%> (ø)
iavl/internal/file_writer.go 71.42% <71.42%> (ø)
iavl/internal/kvdata_writer.go 87.07% <87.07%> (ø)
iavl/internal/kvdata_reader.go 82.97% <82.97%> (ø)

... and 4 files with indirect coverage changes

Impacted file tree graph

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

if unsafe.Sizeof(ChangesetInfo{}) != sizeChangesetInfo {
panic(fmt.Sprintf("invalid ChangesetInfo size: got %d, want %d", unsafe.Sizeof(ChangesetInfo{}), sizeChangesetInfo))
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missing in the previous PR


// ValueOffset is the offset the value data for this node in the key value data file.
// The same size considerations apply here as for KeyOffset.
ValueOffset uint32
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to efficiently cache keys, we need to allow key and value bytes to be non-contiguous in the data file. Adding a separate value offset allows us to put key and value data wherever we want to. Hopefully, the additional 4 bytes per leaf node is offset by more key caching in the kv data file.

"io"
"os"
)
import "github.com/edsrzf/mmap-go"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I am using this off-the-shelf mmap wrapper which has the highest number of known importers on pkg.go.dev: https://pkg.go.dev/github.com/edsrzf/mmap-go?tab=importedby

In the future, it may be worth considering creating our own mmap wrapper. On linux, it may be possible to apply an optimization where we can resize the mmap without unmapping memory: https://stackoverflow.com/questions/74243583/memory-map-file-with-growing-size

@aaronc aaronc marked this pull request as ready for review December 5, 2025 19:11
@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

@aaronc your pull request is missing a changelog!

@aljo242
Copy link
Contributor

aljo242 commented Dec 5, 2025

@aaronc a few more linter compaints

if err != nil {
return fmt.Errorf("failed to read cached key offset at %d: %w", wr.offset, err)
}
wr.offset += 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 4?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we've just read a 4-byte uint32

Comment on lines +29 to +38
// KVEntryKeyBlob indicates a standalone key data entry.
// This should be followed by varint length + raw bytes.
// Used for compacted (non-WAL) leaf or branch keys not already cached.
KVEntryKeyBlob KVEntryType = 0x4

// KVEntryValueBlob indicates a standalone value data entry.
// This should be followed by varint length + raw bytes.
// Used for compacted (non-WAL) leaf values.
// The main difference between KVEntryKeyBlob and KVEntryValueBlob is that key
// entries may be cached for faster access, while value entries are not cached.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain more about the caching here? this wasn't in the previous version was it?

Copy link
Member Author

@aaronc aaronc Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching wasn't in the previous version, no. Basically in looking at data files, the kv data files are much larger than any of the other files, and my theory is we're storing lots of duplicate key data. There are likely lots of storage locations in the database which are written to repeatedly with the same key. Also branch nodes always share keys with some leaf nodes. So introducing this caching is a very simple form of data compression that hopefully will lead to some reduction in storage. There could be other forms of compression we considered, but this seems pretty straightforward and likely to have some pay off, and all it costs us is a little extra memory to maintain the cache while we're writing a file. My suggestion would be to try this and compare data file sizes between the current version and this one.

@technicallyty
Copy link
Contributor

re: discussion for reviewers

i think i'd prefer safety over allowing larger kv. we just need to make sure this design is known in a document somewhere

@aaronc
Copy link
Member Author

aaronc commented Jan 5, 2026

re: discussion for reviewers

i think i'd prefer safety over allowing larger kv. we just need to make sure this design is known in a document somewhere

So would you prefer that I update this PR to error when key or value size exceed the proposed limits (2^16-1 or 64kb for keys and 2^24-1 or 16mb for values)? I can make sure that this is in go doc comments, but there probably should be some document higher up that states this too - not sure where that should be.

@technicallyty
Copy link
Contributor

re: discussion for reviewers
i think i'd prefer safety over allowing larger kv. we just need to make sure this design is known in a document somewhere

So would you prefer that I update this PR to error when key or value size exceed the proposed limits (2^16-1 or 64kb for keys and 2^24-1 or 16mb for values)? I can make sure that this is in go doc comments, but there probably should be some document higher up that states this too - not sure where that should be.

Yes, i think that is the right direction.

not sure where that should be.

Whenever we add the readme back in, we could probably have a section on limits or key/values in general

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants