Skip to content

Commit 741695d

Browse files
parasebadcherian
andauthored
Design notes towards Icechunk 2.0 (#1151)
* Design notes towards Icechunk 2.0 * Update design-docs/010-notes-towards-an-IC-2.md Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> * PR feedback --------- Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
1 parent b66f30b commit 741695d

File tree

2 files changed

+313
-0
lines changed

2 files changed

+313
-0
lines changed
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Notes Towards an Icechunk 2.0
2+
3+
This is intended to be a living document, receiving updates as we come up with
4+
concrete plans and design documents.
5+
6+
## Current limitations and issues
7+
8+
This is a list of potential candidates to improve on with Icechunk 2.0
9+
10+
### Slow `ancestry`
11+
12+
Not a major problem, we don't expect users to navigate ancestry too deeply.
13+
But it becomes more of an issue when it's time to compute expiration, GC,
14+
or storage statistics, because those require navigating all ancestry.
15+
16+
On repos with many versions, this could take minutes, because it's very sequential.
17+
18+
* See [Single Entry Point Object for Refs and Ancestry](./011-ref-and-ancestry-entry-point.md)
19+
20+
### Cannot `amend` instead of `commit`
21+
22+
Users don't really care about having perfect time travel. If they could, they
23+
would `amend` instead of `commit` very frequently. This would create shorter
24+
histories, and less wasted space.
25+
26+
* See [Single Entry Point Object for Refs and Ancestry](./011-ref-and-ancestry-entry-point.md)
27+
28+
### Cannot squash
29+
30+
* See previous section
31+
32+
### Cannot set repository state (lock)
33+
34+
It would be nice to flag repos as online, read-only, archived, what else?.
35+
36+
Of course this status cannot be enforced but it would be useful. Icechunk
37+
can refuse to write to a read-only repo, or to open an archived one. It
38+
gives users an extra level of protection against bugs.
39+
40+
* See [Single Entry Point Object for Refs and Ancestry](./011-ref-and-ancestry-entry-point.md)
41+
42+
### Cannot distribute dirty sessions
43+
44+
A current limitation is that `Session.fork` can only be called in an empty session.
45+
To lift this limitation we need a double-changeset mechanism. The first change-set
46+
tracks the main session, the second change-set tracks the in-session changes.
47+
Merges must only merge in-session changes.
48+
49+
Alternative thought: on fork, if the session is dirty, create an anonymous
50+
snapshot on storage. Use that snapshot as the base for the spawned sessions.
51+
Then skip it in the parent_id, it will be GCed eventually.
52+
53+
### Two object store implementations
54+
55+
Can we move to object_store only? There are trade-offs, particularly because
56+
object_store doesn't support everything we need, and it's slow moving.
57+
58+
To make matters more complicated, we are currently unable to upgrade AWS SDK
59+
because it's hard to build the wheels with newer versions (I'm sure there
60+
are workarounds but it's definitely not easy).
61+
62+
### Cannot persist partial sessions
63+
64+
It would be great to be able to persist session as a form of checkpointing
65+
and to free memory. Also, combining forked sessions could be done
66+
via disk, instead of network + memory.
67+
68+
This can be achieved using anonymous snapshots.
69+
70+
### Cannot `move` efficiently
71+
72+
We currently don't support an efficient `move` operation to change the
73+
structure of the Zarr hierarchy. To implement this nicely it would be
74+
good to add a new `moved` operation to our transaction logs and conflict
75+
detection.
76+
77+
### Hard to know repo spec version
78+
79+
Currently there is no real way to know the spec version of a repo. Best
80+
chance is going through every branch/tag looking at the spec version in
81+
the snapshot file
82+
83+
* See [Single Entry Point Object for Refs and Ancestry](./011-ref-and-ancestry-entry-point.md)
84+
85+
### Support more complex array updates
86+
87+
Example: backfills, inserts, and upserts. This would require some
88+
changes to the manifest to make it efficient, as well as upstream
89+
API additions to Zarr.
90+
91+
### Is Icechunk efficient in the presence of many thousands of nodes?
92+
93+
### Config changes are not tracked
Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
# Single Entry Point Object for Refs and Ancestry
2+
3+
This designs tries to solve multiple issues at once,
4+
see [Notes Towards an Icechunk 2.0Single](./010-notes-towards-an-IC-2.md)
5+
6+
## Design
7+
8+
We add a new object at the root of the repo file structure: `$ROOT/repo`.
9+
The new object is a flatbuffer with the following structure:
10+
11+
```flatbuffer
12+
table Tag {
13+
name: string (required);
14+
snapshot: ObjectId12 (required);
15+
}
16+
17+
table Branch {
18+
name: string (required);
19+
snapshot: ObjectId12 (required);
20+
}
21+
22+
table MetadataItem {
23+
name: string (required);
24+
value: [uint8] (required);
25+
}
26+
27+
table SnapshotInfo {
28+
id: ObjectId12 (required);
29+
parent_offset: u32;
30+
flushed_at: uint64 (required);
31+
message: str (required);
32+
metadata: [MetadataItem] (required);
33+
}
34+
35+
enum RepoAvailability : ubyte { Online = 0, ReadOnly, Offline }
36+
37+
table RepoStatus {
38+
availability: RepoAvailability (required);
39+
limited_availability_reason: string;
40+
set_at: uint64 (required);
41+
}
42+
43+
table Repo {
44+
45+
// sorted by name
46+
tags: [Tag] (required);
47+
48+
// sorted by name
49+
branches: [Branch] (required);
50+
51+
// sorted by name
52+
deleted_tags: [string] (required);
53+
54+
// sorted by id
55+
snapshots: [SnapshotInfo] (required);
56+
57+
last_updated_at: uint64 (required);
58+
59+
status: RepoStatus (required);
60+
spec_version: string (required);
61+
}
62+
63+
root_type Repo;
64+
```
65+
66+
The `repo` object stores:
67+
68+
* The spec version used on the last write to this repo
69+
* The timestamp for the last update
70+
* A status flag that identifies the availability for the repo:
71+
online, read-only, offline
72+
* The list of all tag names with their SnapshotId, sorted by name to be able to
73+
binary search them.
74+
* The list of all branch names with their SnapshotId, sorted by name to be able to
75+
binary search them.
76+
* The list of deleted tags to reject recreation of a deleted tag.
77+
* The list of all snapshots, sorted by id.
78+
* Each snapshot includes its parent id, and metadata needed for `ancestry`
79+
* We don't include the manifest files for the snapshot, that will have
80+
to be removed from the `ancestry` result. Not a big deal, don't imagine
81+
anybody using it. If we don't want to break the API, we'll have to implement
82+
it as a function instead of a field.
83+
* Each snapshot takes ~ 256 bytes:
84+
* SnapshotId: 12 bytes
85+
* Parent offset: 4 bytes
86+
* Flushed at: 8 bytes
87+
* Message: 200 bytes
88+
* Metadata: 30 bytes
89+
* In memory for 10k snapshots ~ 2.5 MB
90+
* Total storage on 10k snapshots ~ 12 GB ~ $3/year. Compression will help some.
91+
92+
### Repository open
93+
94+
The `repo` object is fetched, decompressed and kept in memory in the `Repository`.
95+
Together with the `repo` object we maintain its etag/generation.
96+
97+
Spec version is checked, and an error generated if trying to read from a newer repo.
98+
99+
On `writable_session`, fail if the repo is older, warning that this can
100+
break forward compatibility. User can pass a flag to allow the breakage.
101+
102+
### Ancestry
103+
104+
`ancestry` becomes much faster because it only needs to look at the in memory
105+
`repo` flatbuffer. On each new pull from the iterator, the parent offset of
106+
the current snapshot is fetched.
107+
108+
Notice that `snapshot` objects still have a `parent_id`, but this is not
109+
used for ancestry any more, or for anything else currently.
110+
111+
### Commit
112+
113+
The process for commit only changes when it's time to update the branch ref.
114+
115+
* A new `repo` object is generated with:
116+
* the new snapshot added
117+
* the updated branch pointing to the new snapshot
118+
* Update the `repo` object conditionally on not being modified since the
119+
repo was opened
120+
* If successful, commit done
121+
* If condition violation, pull the new `repo` object.
122+
* Verify conditions for the commit:
123+
* Repo is online
124+
* Spec version matches
125+
* If tip of branch has not changed, merge the files and save new file conditionally.
126+
Iterate as needed. The merge process is adds the new snapshot to the file
127+
fetched from storage
128+
* If tip of branch has changed, rebase is needed
129+
130+
### Rebase
131+
132+
TODO
133+
134+
### Amend
135+
136+
`amend` requires the same commit process as before, but use parent's
137+
parent as snapshot parent in the `repo` file.
138+
139+
Then, find any snapshots in `repo` that have the amended commit as parent and make
140+
them point to new snap. Finally, find any tags/branches pointing to the amended
141+
commit and make them point to the new snapshot.
142+
143+
Finally the usual commit process is followed. So it only differs in the
144+
preparation of the snapshot object (different parent) and the `repo` object.
145+
146+
### Expiration
147+
148+
`expire_snapshots` becomes an operation done only on the `repo` file. In particular
149+
expiration no longer needs to overwrite snapshots.
150+
151+
Once a new `repo` file is generated in memory, it's written using conditionals.
152+
Failures are recovered by editing the file again.
153+
154+
Expiration can generate conflict with ongoing sessions, because it modifies
155+
the `repo` file, but those generate recoverable conflicts which can be
156+
handled within `commit` without user intervention.
157+
158+
### GC
159+
160+
Not affected, just faster because ancestry is faster.
161+
162+
Should this change the updated_at flag in `repo`? Maybe we need more fields.
163+
164+
### Storage stats
165+
166+
Not affected, just faster because ancestry is faster.
167+
168+
### Branch create
169+
170+
A simple conditional update on the `repo` file. Failures are retried.
171+
Rejected if branch already exists.
172+
173+
### Branch delete
174+
175+
* Delete branch and all snapshots that are only accessible from it.
176+
* Conditional update of `repo` file with retries. Rejected if branch changed.
177+
178+
### Branch reset
179+
180+
* Update the branch pointer. Delete any snapshots only accessible by the old branch.
181+
* Conditional update of `repo` file with retries. Reject if branch changed.
182+
183+
### Change repo status
184+
185+
Updates `repo` conditionally. Conflicts on ongoing commits need to be handled
186+
rejecting writes if needed.
187+
188+
### What to do with SnapshotInfo in snapshots that are duplicated now
189+
190+
* We keep them there for now
191+
192+
### Store extra info in the `refs` file
193+
194+
* We could put more stuff in here: stats, etc
195+
196+
## Upgrade from 1.x repos
197+
198+
* TODO
199+
200+
## Trade-offs
201+
202+
* Pros:
203+
* Fast ancestry, fast expiration
204+
* Can implement amend, which should enable shorter histories
205+
* More consistency
206+
* Cons:
207+
* `repo` object can reach ~ 2.5 MB
208+
* Need to write this larger object on every commit, including when creating
209+
the repo.
210+
* Overhead on commit to write the `repo` object
211+
* Overhead on rebase to read the larger `repo` object, potentially multiple times.
212+
* More storage overhead.
213+
* Ref operations are more complex
214+
215+
## Questions
216+
217+
* Should we move the config to this new object?
218+
* What other fields should we add?
219+
* Arbitrary user properties on the repo?
220+
* Global stats?

0 commit comments

Comments
 (0)