|
| 1 | +# iavl |
| 2 | + |
| 3 | +## Code Organization |
| 4 | + |
| 5 | +### Node Types, Memory & Disk Layouts |
| 6 | + |
| 7 | +Much of this code was influenced by memiavl and sometimes even copied directly from it. |
| 8 | +The `NodeID` design is mainly from iavl/v2. |
| 9 | +The `NodePointer` design introduces the possibility of doing node eviction similar to iavl/v2, |
| 10 | +but with non-blocking thread safety using `atomic.Pointer` so that eviction can happen in the background without |
| 11 | +blocking reads or writes. |
| 12 | + |
| 13 | +* `node.go`: the `Node` interface which all 3 node types implement (`MemNode`, `BranchPersisted`, `LeafPersisted`) |
| 14 | +* `mem_node.go`: in-memory node structure, new nodes always use the `MemNode` type |
| 15 | +* `node_pointer.go`: all child references are wrapped in `NodePointer` which can point to either an in-memory node or an |
| 16 | + on-disk node, or both (if the node has been written and node evicted) |
| 17 | +* `node_id.go`: defines `NodeID` (version + index + leaf) and `NodeRef` (either a `NodeID` or a node offset in the |
| 18 | + changeset file) |
| 19 | +* `branch_layout.go`: defines the on-disk layout for branch nodes |
| 20 | +* `leaf_layout.go`: defines the on-disk layout for leaf nodes |
| 21 | +* `branch_persisted.go`: a wrapper around `BranchLayout` which implements the `Node` interface and also tracks a store |
| 22 | + reference |
| 23 | +* `leaf_persisted.go`: a wrapper around `LeafLayout` which implements the `Node` interface and also tracks a store |
| 24 | + reference |
| 25 | + |
| 26 | +### Tree Management & Updating |
| 27 | + |
| 28 | +For managing tree state, we define two core types `Tree` and `CommitTree`. |
| 29 | +We directly read from and apply updates to `Tree`s but these updates only affect the persistent state of the tree if |
| 30 | +they are applied and committed to a `CommitTree`. |
| 31 | + |
| 32 | +* `tree.go`: a `Tree` struct which implements the Cosmos SDK `KVStore` interface and implements the key methods (get, |
| 33 | + set, |
| 34 | + delete, commit, etc). `Tree`s can be mutated, and changes can either be committed or discarded. This is essentially an |
| 35 | + in-memory reference to a tree at a specific version that could be used read-only or mutated ad hoc without affecting |
| 36 | + the underlying persistent tree (say for instance in `CheckTx`). |
| 37 | +* `commit_tree.go`: defines the `CommitTree` structure which manages the persistent tree state. Using `CommitTree` you |
| 38 | + can |
| 39 | + create new mutable `Tree` instance using `Branch` and decide to `Apply` its changes to the persistent tree or discard |
| 40 | + them. Calling `Commit` flushes changes to the underlying `TreeStore` which does all of the on disk state management |
| 41 | + and cleanup. In `CommitTree` we also have an asynchronous WAL writing process (optional) and maintain a background |
| 42 | + eviction process. |
| 43 | +* `update.go`: types for batching changes which can later be commited or discarded |
| 44 | +* `node_update.go` and : the code for setting and deleting nodes and doing tree rebalancing, adapted from memiavl and |
| 45 | + iavl/v1 |
| 46 | +* `node_hash.go`: code for computing node hashes, adapted from memiavl and iavl/v1 |
| 47 | +* `iterator.go`: implements the Cosmos SDK `Iterator` interface, adapted from memiavl and iavl/v1 |
| 48 | + |
| 49 | +### Disk State Management |
| 50 | + |
| 51 | +### Central Coordination |
| 52 | + |
| 53 | +These files are the central core of managing on-disk state across multiple changesets which may be in the process of |
| 54 | +being written or compacted. **This is the most complex part of the codebase.** |
| 55 | + |
| 56 | +* `tree_store.go`: code for dispatching read operations to the correct changeset, writing commits to new changesets, |
| 57 | + and coordinating background compaction and cleanup of old changesets |
| 58 | +* `cleanup.go`: the actual background cleanup and compaction thread |
| 59 | + |
| 60 | +#### Changeset Reading, Writing and Compaction |
| 61 | + |
| 62 | +* `changeset_files.go`: `ChangesetFiles` represents the five files which make up a changeset: |
| 63 | + * `kv.log`: all of the key/value pairs in the changeset, and optionally the write-ahead log for replay (this is |
| 64 | + configurable) |
| 65 | + * `leaves.dat`: an array of `LeafLayout` structs |
| 66 | + * `branches.dat`: an array of `BranchLayout` structs |
| 67 | + * `verions.dat`: an array of `VersionInfo` structs, one for each version in the changeset |
| 68 | + * `info.dat`: a single `ChangesetInfo` struct which tracks metadata about the changeset including the range of |
| 69 | + versions |
| 70 | + it contains and the number of orphaned nodes |
| 71 | +* `changeset.go`: the `Changeset` struct wraps mmap's of the five changeset files and provides |
| 72 | + methods for reading nodes from disk and marking them as orphaned. It includes some complex code for safely disposing |
| 73 | + of `Changeset` instances because we need to either 1) reopen the memmap to change its size, or 2) close the |
| 74 | + `Changeset` because it has been compacted and will be deleted. This is managed using pinning, a reference count, and |
| 75 | + atomic booleans to track eviction (the desire to dispose and delete) and disposal (the actual disposal). |
| 76 | +* `changeset_writer.go`: code for iteratively writing changesets to disk node by node in post-order traversal order. |
| 77 | + Node references can either be by |
| 78 | + `NodeID` or offsets (offsets have been disabled due to some unresolved bugs) |
| 79 | +* `compactor.go`: code for rewriting one or more changesets into a new compacted changeset, skipping |
| 80 | + orphaned nodes and updating offsets as needed (this offset rewrite code is currently buggy and disabled) |
| 81 | + |
| 82 | +#### Helpers |
| 83 | + |
| 84 | +* `version_info.go`: defines the on-disk layout for version info records, which track the root node and other metadata |
| 85 | + for |
| 86 | + each version |
| 87 | +* `changeset_info.go`: defines the on-disk layout for the changeset info record, which tracks metadata |
| 88 | + about the entire changeset including version range and number of orphaned nodes |
| 89 | +* `kvlog.go`: code for reading key/value pairs from the `kv.log` file |
| 90 | +* `kvlog_writer.go`: code for writing key/value pairs to the `kv.log` file, which can be structured as a write-ahead |
| 91 | + operation log for replay and crash recovery (reply and recovery aren't implemented yet) |
| 92 | +* `mmap.go`: the `MmapFile` mem-map wrapper |
| 93 | +* `writer.go`: `FileWriter` and `StructWriter` wrappers for writing raw bytes and structs to files |
| 94 | +* `reader.go`: `StructMap` and `NodeMap` wrappers for representing memory-mapped arrays of structs and nodes |
| 95 | + |
| 96 | +### Multi-tree Management |
| 97 | + |
| 98 | +* `multi_tree.go`: wraps multiple `Tree`s into a `MultiTree` which provides a mutable way to write a tree without |
| 99 | + committing the changes to the persistent tree immediately (can be discarded) |
| 100 | +* `commit_multi_tree.go`: wraps multiple `CommitTree`s into a `CommitMultiTree` which provides a way to create mutable |
| 101 | + `MultiTree`s and commit their changes to the underlying persistent trees (or discard them). This can eventually |
| 102 | + implement `RootMultiStore` and replace the SDK's store package. `CommitMultiTree` makes the optimization of running |
| 103 | + `Commit` in parallel across all `CommitTree`s which could improve performance. |
| 104 | + |
| 105 | +### Options |
| 106 | + |
| 107 | +Options are mantained by the `Options` struct in `options.go`. Many options have a getter which uses a default value if |
| 108 | +the option is not set. |
| 109 | + |
| 110 | +The main options we're controlling now are: |
| 111 | + |
| 112 | +* `WriteWAL`: whether we write all updates to the kv-log as a replayable write-ahead log (WAL). If this is enabled we |
| 113 | + will fsync the WAL either asynchronously or synchronously (based on the `WalSyncBuffer` option). Enabling WAL could |
| 114 | + actually improve performance because we asynchronously write key/value data in advance of `CommitTree.Commit` being |
| 115 | + called. |
| 116 | +* `EvictDepth`: the depth of the tree beyond which we will evict nodes from memory as soon as they are on disk. This is |
| 117 | + the main lever for controlling memory usage. Using more memory could improve performance. |
| 118 | +* `RetainVersions`: the number of recent versions to retain when we are compacting. Eventually we also want to enable |
| 119 | + some sort of snapshot-based compaction (retaining full trees every N versions). |
| 120 | +* `MinCompactionSeconds`: the minimum number of seconds to wait before starting a new compaction run (note that this |
| 121 | + currently includes the time it takes to compact). |
| 122 | +* `CompactWAL`: whether to compact the WAL when we are compacting changesets. In the future, we can distinguish between |
| 123 | + compacting the WAL before our first checkpoint and retaining it after the first checkpoint. |
| 124 | +* `ChangesetMaxTarget`: the size of a changeset after which we will roll over to a new changeset for the next version. |
| 125 | +* `CompactionMaxTarget`: the target size of a compacted changeset. When adding a new changeset into our compaction will |
| 126 | + stay below this number, we will join multiple changesets into a single compacted changeset. |
| 127 | +* `CompactionOrphanRatio`: the ratio of orphaned nodes in a changeset beyond which we will trigger it for early |
| 128 | + compaction (used together with `CompactionOrphanAge`) |
| 129 | +* `CompactionOrphanAge`: the average age of orphaned nodes in a changeset beyond which we will trigger it for early |
| 130 | + compaction (used together with `CompactionOrphanRatio`) |
| 131 | +* `CompactAfterVersions`: the number of versions after which we will trigger a compaction when any orphans are present, |
| 132 | + measured in versions since the last compaction. |
| 133 | +* `ReaderUpdateInterval`: when writing multiple versions to a changeset, the number of versions after which we will open |
| 134 | + the changeset for reading even if it has not been completed, so that readers can access the latest versions sooner and |
| 135 | + flush memory. Set this to a shorter interval if we want to constrain memory usage more tightly and longer if we want |
| 136 | + to reduce the number of times memmaps are re-opened for reading. |
| 137 | + |
| 138 | +### Utilities |
| 139 | + |
| 140 | +* `dot_graph.go`: code for exporting trees to Graphviz dot graph format for visualization |
| 141 | +* `verify.go`: code for verifying tree integrity |
| 142 | + |
| 143 | +### Tests |
| 144 | + |
| 145 | +* `tree_test.go`: the only tests we have so far. These do, however, use property-based testing so we are generating |
| 146 | + random operation sets, applying them to both iavlx and iavl/v1 trees. At each step, we confirm that behavior is |
| 147 | + identical, including verification of hashes and verifying that invariants are maintained. |
0 commit comments