Conversation
| } | ||
|
|
||
| impl ManifestSplittingConfig { | ||
| pub fn get_split_sizes(&self, node: &NodeSnapshot) -> SessionResult<ManifestSplits> { |
There was a problem hiding this comment.
Only the indent has changed
d5398aa to
e5f3d9e
Compare
e5f3d9e to
52a6e4c
Compare
icechunk/src/change_set.rs
Outdated
| &self, | ||
| node_id: &NodeId, | ||
| node_path: &Path, | ||
| extent: Option<ManifestExtents>, |
There was a problem hiding this comment.
I will have to learn some lifetime things to make this Option<&ManifestExtents>.
EDIT: haven't tackled this yet
| change_set: ChangeSet, | ||
| default_commit_metadata: SnapshotProperties, | ||
| // This is an optimization so that we needn't figure out the split sizes on every set. | ||
| splits: HashMap<NodeId, ManifestSplits>, |
There was a problem hiding this comment.
Stored on Session since we'll need to grab splits for arrays created in a previous session, but being modified in this one & the change_set doesn't have enough info (it receives NodeId, not NodeSnapshot IIRC)
icechunk/src/session.rs
Outdated
| // Q: What happens if we set a chunk, then change a dimension name, so | ||
| // that the split changes. | ||
| // A: We ignore it. splits are set once for a node in a session, and are never changed. | ||
| // self.cache_splits(&node.id, path, &shape, &dimension_names); |
There was a problem hiding this comment.
We could be stricter and error here if the splits change.
icechunk/src/session.rs
Outdated
| snapshot_id: &'a SnapshotId, | ||
| change_set: &'a ChangeSet, | ||
| node: NodeSnapshot, | ||
| extent: Option<ManifestExtents>, |
There was a problem hiding this comment.
Important: All iterators have gain Option<ManifestExtents> to do some filtering to extents=. This is where I'll need to add lifetimes to switch to Option<&ManifestExtents>
There was a problem hiding this comment.
I wonder if for all these cases, instead of using Option it's easier to introduce a (constant) match all ManifestExtents: ManifestExtents::all : ManifestExtents
| ) -> SessionResult<Option<ManifestRef>> { | ||
| let mut from = vec![]; | ||
| let mut to = vec![]; | ||
| let chunks = aggregate_extents(&mut from, &mut to, chunks, |ci| &ci.coord); |
There was a problem hiding this comment.
I had to bring this back.
AFAICT there's no way to update the extents when chunks are deleted without scanning the whole thing. To do so, we'll need to keep track of "how many chunks have chunk index 'i'", decrement that counter when deleting chunks, and then construct a bbox from that histogram when needed. This would be a format change.
| #[allow(clippy::expect_used)] | ||
| let split_sizes = self | ||
| .split_sizes | ||
| pub fn get_split_sizes( |
There was a problem hiding this comment.
no logic changes, just the signature. Arguably, it should've been this way earlier 🤦🏾♂️
| Ok(asset_manager.fetch_manifest(manifest_id, manifest_info.size_bytes).await?) | ||
| } | ||
|
|
||
| /// Map the iterator to accumulate the extents of the chunks traversed |
paraseba
left a comment
There was a problem hiding this comment.
I made a first pass to the easy parts. Still need to look into the manifest generation, but maybe we can chat about the easy parts first.
It looks great, made a few minor comments.
One thing I still need to understand is how will you know if a manifest split became empty after chunk deletes. That's probably in the parts I haven't read yet.
icechunk/src/format/manifest.rs
Outdated
| let ranges = std::iter::zip(self.iter(), other.iter()) | ||
| .map(|(a, b)| max(a.start, b.start)..min(a.end, b.end)) | ||
| .collect::<Vec<_>>(); | ||
| if any(ranges.iter(), |r| r.end < r.start) { None } else { Some(Self(ranges)) } |
There was a problem hiding this comment.
should the condition be r.end <= r.start? Aren't these [) intervals? One empty set gives an empty intersection
There was a problem hiding this comment.
Should we disallow empty ManifestExtents? I thought we were doing that, but apparently not.
There was a problem hiding this comment.
I think it would save us from confusion, yes.
icechunk/src/change_set.rs
Outdated
| splits: &ManifestSplits, | ||
| ) { | ||
| #[allow(clippy::expect_used)] | ||
| let (_, extent) = splits.which_extent_and_index(&coord).expect("logic bug. Trying to set chunk ref but can't find the appropriate split manifest."); |
There was a problem hiding this comment.
I think this is what i meant in my previous comment, isn't the index enough if you saved the SplitManifests in a vector instead of a HashMap?
icechunk/src/session.rs
Outdated
| None, | ||
| } | ||
|
|
||
| /// Important: this is not symmetric. |
There was a problem hiding this comment.
maybe adding it with a different name to the impl would help make the asymmetry more clear?
There was a problem hiding this comment.
Now moved to ManifestExtents.overlap_with(other: &ManifestExtents)
icechunk/src/change_set.rs
Outdated
| // test_cases: increasing size of (multiple) dimensions | ||
| // decreasing size of (multiple) dimensions | ||
| // | ||
| new_splits |
There was a problem hiding this comment.
yeah this is ugly. Since we track extents on the change set, we need to update them for any resize 🤦🏾♂️
6fe675b to
6d52d2c
Compare
| ) | ||
| }); | ||
| for (extent, their_manifest) in other_splits { | ||
| manifests.entry(extent).or_default().extend(their_manifest) |
There was a problem hiding this comment.
this is readable but seems silly for the default case where we can just insert their_manifest.
c775d21 to
ce67942
Compare
| // This map keeps track of any chunk deletes that are | ||
| // outside the domain of the current array shape. This is needed to handle | ||
| // the very unlikely case of multiple resizes in the same session. | ||
| deleted_chunks_outside_bounds: BTreeMap<NodeId, HashSet<ChunkIndices>>, |
There was a problem hiding this comment.
This seems to be the cleanest solution to me.
This commit tried tracking all deleted chunks separately (5af247e) but it's very messy.
ce67942 to
cd750e6
Compare
| - 1 * ((ax > 0) as usize); | ||
| dbg!(&ax, &add); | ||
| total_manifests += add; | ||
| total_manifests += |
There was a problem hiding this comment.
number went down!
50f2a27 to
46d13b6
Compare
900f798 to
b7fb385
Compare
| import functools | ||
|
|
||
|
|
||
| def with_frequency(frequency): |
There was a problem hiding this comment.
this was Claude. looks good!
4de248c to
1020ec1
Compare
1020ec1 to
b179dbc
Compare
2a681ce to
4e68231
Compare
|
Some benchmarks for a repo with 500_000 chunk refs, written to 50 manifests with 10_000 refs each.
After discussion with Seba, I have reverted that async write commit, and this PR should be good to go. I'll address #986 in a new PR. Without async writesWith async writes |
This reverts commit 3b52259.
Closes #604
Closes #332
TODO: