Skip to content

Commit a50376f

Browse files
committed
feat: add iavlx
1 parent d081460 commit a50376f

36 files changed

+5251
-0
lines changed

iavl/README.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# iavl
2+
3+
## Code Organization
4+
5+
### Node Types, Memory & Disk Layouts
6+
7+
Much of this code was influenced by memiavl and sometimes even copied directly from it.
8+
The `NodeID` design is mainly from iavl/v2.
9+
The `NodePointer` design introduces the possibility of doing node eviction similar to iavl/v2,
10+
but with non-blocking thread safety using `atomic.Pointer` so that eviction can happen in the background without
11+
blocking reads or writes.
12+
13+
* `node.go`: the `Node` interface which all 3 node types implement (`MemNode`, `BranchPersisted`, `LeafPersisted`)
14+
* `mem_node.go`: in-memory node structure, new nodes always use the `MemNode` type
15+
* `node_pointer.go`: all child references are wrapped in `NodePointer` which can point to either an in-memory node or an
16+
on-disk node, or both (if the node has been written and node evicted)
17+
* `node_id.go`: defines `NodeID` (version + index + leaf) and `NodeRef` (either a `NodeID` or a node offset in the
18+
changeset file)
19+
* `branch_layout.go`: defines the on-disk layout for branch nodes
20+
* `leaf_layout.go`: defines the on-disk layout for leaf nodes
21+
* `branch_persisted.go`: a wrapper around `BranchLayout` which implements the `Node` interface and also tracks a store
22+
reference
23+
* `leaf_persisted.go`: a wrapper around `LeafLayout` which implements the `Node` interface and also tracks a store
24+
reference
25+
26+
### Tree Management & Updating
27+
28+
For managing tree state, we define two core types `Tree` and `CommitTree`.
29+
We directly read from and apply updates to `Tree`s but these updates only affect the persistent state of the tree if
30+
they are applied and committed to a `CommitTree`.
31+
32+
* `tree.go`: a `Tree` struct which implements the Cosmos SDK `KVStore` interface and implements the key methods (get,
33+
set,
34+
delete, commit, etc). `Tree`s can be mutated, and changes can either be committed or discarded. This is essentially an
35+
in-memory reference to a tree at a specific version that could be used read-only or mutated ad hoc without affecting
36+
the underlying persistent tree (say for instance in `CheckTx`).
37+
* `commit_tree.go`: defines the `CommitTree` structure which manages the persistent tree state. Using `CommitTree` you
38+
can
39+
create new mutable `Tree` instance using `Branch` and decide to `Apply` its changes to the persistent tree or discard
40+
them. Calling `Commit` flushes changes to the underlying `TreeStore` which does all of the on disk state management
41+
and cleanup. In `CommitTree` we also have an asynchronous WAL writing process (optional) and maintain a background
42+
eviction process.
43+
* `update.go`: types for batching changes which can later be commited or discarded
44+
* `node_update.go` and : the code for setting and deleting nodes and doing tree rebalancing, adapted from memiavl and
45+
iavl/v1
46+
* `node_hash.go`: code for computing node hashes, adapted from memiavl and iavl/v1
47+
* `iterator.go`: implements the Cosmos SDK `Iterator` interface, adapted from memiavl and iavl/v1
48+
49+
### Disk State Management
50+
51+
### Central Coordination
52+
53+
These files are the central core of managing on-disk state across multiple changesets which may be in the process of
54+
being written or compacted. **This is the most complex part of the codebase.**
55+
56+
* `tree_store.go`: code for dispatching read operations to the correct changeset, writing commits to new changesets,
57+
and coordinating background compaction and cleanup of old changesets
58+
* `cleanup.go`: the actual background cleanup and compaction thread
59+
60+
#### Changeset Reading, Writing and Compaction
61+
62+
* `changeset_files.go`: `ChangesetFiles` represents the five files which make up a changeset:
63+
* `kv.log`: all of the key/value pairs in the changeset, and optionally the write-ahead log for replay (this is
64+
configurable)
65+
* `leaves.dat`: an array of `LeafLayout` structs
66+
* `branches.dat`: an array of `BranchLayout` structs
67+
* `verions.dat`: an array of `VersionInfo` structs, one for each version in the changeset
68+
* `info.dat`: a single `ChangesetInfo` struct which tracks metadata about the changeset including the range of
69+
versions
70+
it contains and the number of orphaned nodes
71+
* `changeset.go`: the `Changeset` struct wraps mmap's of the five changeset files and provides
72+
methods for reading nodes from disk and marking them as orphaned. It includes some complex code for safely disposing
73+
of `Changeset` instances because we need to either 1) reopen the memmap to change its size, or 2) close the
74+
`Changeset` because it has been compacted and will be deleted. This is managed using pinning, a reference count, and
75+
atomic booleans to track eviction (the desire to dispose and delete) and disposal (the actual disposal).
76+
* `changeset_writer.go`: code for iteratively writing changesets to disk node by node in post-order traversal order.
77+
Node references can either be by
78+
`NodeID` or offsets (offsets have been disabled due to some unresolved bugs)
79+
* `compactor.go`: code for rewriting one or more changesets into a new compacted changeset, skipping
80+
orphaned nodes and updating offsets as needed (this offset rewrite code is currently buggy and disabled)
81+
82+
#### Helpers
83+
84+
* `version_info.go`: defines the on-disk layout for version info records, which track the root node and other metadata
85+
for
86+
each version
87+
* `changeset_info.go`: defines the on-disk layout for the changeset info record, which tracks metadata
88+
about the entire changeset including version range and number of orphaned nodes
89+
* `kvlog.go`: code for reading key/value pairs from the `kv.log` file
90+
* `kvlog_writer.go`: code for writing key/value pairs to the `kv.log` file, which can be structured as a write-ahead
91+
operation log for replay and crash recovery (reply and recovery aren't implemented yet)
92+
* `mmap.go`: the `MmapFile` mem-map wrapper
93+
* `writer.go`: `FileWriter` and `StructWriter` wrappers for writing raw bytes and structs to files
94+
* `reader.go`: `StructMap` and `NodeMap` wrappers for representing memory-mapped arrays of structs and nodes
95+
96+
### Multi-tree Management
97+
98+
* `multi_tree.go`: wraps multiple `Tree`s into a `MultiTree` which provides a mutable way to write a tree without
99+
committing the changes to the persistent tree immediately (can be discarded)
100+
* `commit_multi_tree.go`: wraps multiple `CommitTree`s into a `CommitMultiTree` which provides a way to create mutable
101+
`MultiTree`s and commit their changes to the underlying persistent trees (or discard them). This can eventually
102+
implement `RootMultiStore` and replace the SDK's store package. `CommitMultiTree` makes the optimization of running
103+
`Commit` in parallel across all `CommitTree`s which could improve performance.
104+
105+
### Options
106+
107+
Options are mantained by the `Options` struct in `options.go`. Many options have a getter which uses a default value if
108+
the option is not set.
109+
110+
The main options we're controlling now are:
111+
112+
* `WriteWAL`: whether we write all updates to the kv-log as a replayable write-ahead log (WAL). If this is enabled we
113+
will fsync the WAL either asynchronously or synchronously (based on the `WalSyncBuffer` option). Enabling WAL could
114+
actually improve performance because we asynchronously write key/value data in advance of `CommitTree.Commit` being
115+
called.
116+
* `EvictDepth`: the depth of the tree beyond which we will evict nodes from memory as soon as they are on disk. This is
117+
the main lever for controlling memory usage. Using more memory could improve performance.
118+
* `RetainVersions`: the number of recent versions to retain when we are compacting. Eventually we also want to enable
119+
some sort of snapshot-based compaction (retaining full trees every N versions).
120+
* `MinCompactionSeconds`: the minimum number of seconds to wait before starting a new compaction run (note that this
121+
currently includes the time it takes to compact).
122+
* `CompactWAL`: whether to compact the WAL when we are compacting changesets. In the future, we can distinguish between
123+
compacting the WAL before our first checkpoint and retaining it after the first checkpoint.
124+
* `ChangesetMaxTarget`: the size of a changeset after which we will roll over to a new changeset for the next version.
125+
* `CompactionMaxTarget`: the target size of a compacted changeset. When adding a new changeset into our compaction will
126+
stay below this number, we will join multiple changesets into a single compacted changeset.
127+
* `CompactionOrphanRatio`: the ratio of orphaned nodes in a changeset beyond which we will trigger it for early
128+
compaction (used together with `CompactionOrphanAge`)
129+
* `CompactionOrphanAge`: the average age of orphaned nodes in a changeset beyond which we will trigger it for early
130+
compaction (used together with `CompactionOrphanRatio`)
131+
* `CompactAfterVersions`: the number of versions after which we will trigger a compaction when any orphans are present,
132+
measured in versions since the last compaction.
133+
* `ReaderUpdateInterval`: when writing multiple versions to a changeset, the number of versions after which we will open
134+
the changeset for reading even if it has not been completed, so that readers can access the latest versions sooner and
135+
flush memory. Set this to a shorter interval if we want to constrain memory usage more tightly and longer if we want
136+
to reduce the number of times memmaps are re-opened for reading.
137+
138+
### Utilities
139+
140+
* `dot_graph.go`: code for exporting trees to Graphviz dot graph format for visualization
141+
* `verify.go`: code for verifying tree integrity
142+
143+
### Tests
144+
145+
* `tree_test.go`: the only tests we have so far. These do, however, use property-based testing so we are generating
146+
random operation sets, applying them to both iavlx and iavl/v1 trees. At each step, we confirm that behavior is
147+
identical, including verification of hashes and verifying that invariants are maintained.

iavl/branch_layout.go

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
package iavlx
2+
3+
import (
4+
"fmt"
5+
"unsafe"
6+
)
7+
8+
func init() {
9+
if unsafe.Sizeof(BranchLayout{}) != SizeBranch {
10+
panic(fmt.Sprintf("invalid BranchLayout size: got %d, want %d", unsafe.Sizeof(BranchLayout{}), SizeBranch))
11+
}
12+
}
13+
14+
const (
15+
SizeBranch = 72
16+
)
17+
18+
type BranchLayout struct {
19+
Id NodeID
20+
Left NodeRef
21+
Right NodeRef
22+
KeyOffset uint32
23+
Height uint8
24+
Size uint32 // TODO 5 bytes?
25+
OrphanVersion uint32 // TODO 5 bytes?
26+
Hash [32]byte
27+
}
28+
29+
func (b BranchLayout) ID() NodeID {
30+
return b.Id
31+
}

iavl/branch_persisted.go

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
package iavlx
2+
3+
import "bytes"
4+
5+
type BranchPersisted struct {
6+
store *Changeset
7+
selfIdx uint32
8+
layout BranchLayout
9+
leftPtr, rightPtr *NodePointer
10+
}
11+
12+
func (node *BranchPersisted) ID() NodeID {
13+
return node.layout.Id
14+
}
15+
16+
func (node *BranchPersisted) Height() uint8 {
17+
return node.layout.Height
18+
}
19+
20+
func (node *BranchPersisted) IsLeaf() bool {
21+
return false
22+
}
23+
24+
func (node *BranchPersisted) Size() int64 {
25+
return int64(node.layout.Size)
26+
}
27+
28+
func (node *BranchPersisted) Version() uint32 {
29+
return uint32(node.layout.Id.Version())
30+
}
31+
32+
func (node *BranchPersisted) Key() ([]byte, error) {
33+
return node.store.ReadK(node.layout.Id, node.layout.KeyOffset)
34+
}
35+
36+
func (node *BranchPersisted) Value() ([]byte, error) {
37+
return nil, nil
38+
}
39+
40+
func (node *BranchPersisted) Left() *NodePointer {
41+
return node.leftPtr
42+
}
43+
44+
func (node *BranchPersisted) Right() *NodePointer {
45+
return node.rightPtr
46+
}
47+
48+
func (node *BranchPersisted) Hash() []byte {
49+
return node.layout.Hash[:]
50+
}
51+
52+
func (node *BranchPersisted) SafeHash() []byte {
53+
return node.layout.Hash[:]
54+
}
55+
56+
func (node *BranchPersisted) MutateBranch(version uint32) (*MemNode, error) {
57+
key, err := node.Key()
58+
if err != nil {
59+
return nil, err
60+
}
61+
memNode := &MemNode{
62+
height: node.Height(),
63+
size: node.Size(),
64+
version: version,
65+
key: key,
66+
left: node.Left(),
67+
right: node.Right(),
68+
}
69+
return memNode, err
70+
}
71+
72+
func (node *BranchPersisted) Get(key []byte) (value []byte, index int64, err error) {
73+
nodeKey, err := node.Key()
74+
if err != nil {
75+
return nil, 0, err
76+
}
77+
78+
if bytes.Compare(key, nodeKey) < 0 {
79+
leftNode, err := node.Left().Resolve()
80+
if err != nil {
81+
return nil, 0, err
82+
}
83+
84+
return leftNode.Get(key)
85+
}
86+
87+
rightNode, err := node.Right().Resolve()
88+
if err != nil {
89+
return nil, 0, err
90+
}
91+
92+
value, index, err = rightNode.Get(key)
93+
if err != nil {
94+
return nil, 0, err
95+
}
96+
97+
index += node.Size() - rightNode.Size()
98+
return value, index, nil
99+
}
100+
101+
func (node *BranchPersisted) String() string {
102+
//TODO implement me
103+
panic("implement me")
104+
}
105+
106+
var _ Node = (*BranchPersisted)(nil)

0 commit comments

Comments
 (0)