Skip to content

Commit 7a20aec

Browse files
committed
Add docs EAR-1708
1 parent 3053841 commit 7a20aec

6 files changed

Lines changed: 184 additions & 41 deletions

File tree

docs/docs/icechunk-python/performance.md

Lines changed: 114 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,17 @@ For very large arrays (millions of chunks), these files can get quite large.
2020
By default, Icechunk stores all chunk references in a single manifest file per array.
2121
Requesting even a single chunk requires downloading the entire manifest.
2222
In some cases, this can result in a slow time-to-first-byte or large memory usage.
23+
Similarly, appending a small amount of data to a large array requires
24+
downloading and rewriting the entire manifest.
2325

2426
!!! note
2527

2628
Note that the chunk sizes in the following examples are tiny for demonstration purposes.
2729

28-
To avoid that, Icechunk lets you split the manifest files by specifying a ``ManifestSplittingConfig``.
2930

31+
### Configuring splitting
32+
33+
To solve this issue, Icechunk lets you __split_ the manifest files by specifying a ``ManifestSplittingConfig``.
3034
```python exec="on" session="perf" source="material-block"
3135
import icechunk as ic
3236
from icechunk import ManifestSplitCondition, ManifestSplittingConfig, ManifestSplitDimCondition
@@ -38,14 +42,20 @@ split_config = ManifestSplittingConfig.from_dict(
3842
}
3943
}
4044
)
41-
repo_config = ic.RepositoryConfig(manifest=ic.ManifestConfig(splitting=split_config))
45+
repo_config = ic.RepositoryConfig(
46+
manifest=ic.ManifestConfig(splitting=split_config),
47+
)
4248
```
4349

4450
Then pass the config to `Repository.open` or `Repository.create`
4551
```python
4652
repo = ic.Repository.open(..., config=repo_config)
4753
```
4854

55+
!!! important
56+
57+
Once you find a splitting configuration you like, remember to persist it on-disk using `repo.save_config`.
58+
4959
This particular example splits manifests so that each manifest contains `365 * 24` chunks along the time dimension, and every chunk along every other dimension in a single file.
5060

5161
Options for specifying the arrays whose manifest you want to split are:
@@ -92,3 +102,105 @@ will result in splitting manifests so that each manifest contains (3 longitude c
92102
!!! note
93103

94104
Python dictionaries preserve insertion order, so the first condition encountered takes priority.
105+
106+
107+
108+
### Splitting behaviour
109+
110+
By default, Icechunk minimizes the number of chunk refs that are written in a single commit.
111+
112+
Consider this simple example: a 1D array with split size 1 along axis 0.
113+
```python exec="on" session="perf" source="material-block"
114+
import random
115+
116+
import icechunk as ic
117+
from icechunk import (
118+
ManifestSplitCondition,
119+
ManifestSplitDimCondition,
120+
ManifestSplittingConfig,
121+
)
122+
123+
split_config = ManifestSplittingConfig.from_dict(
124+
{ManifestSplitCondition.AnyArray(): {ManifestSplitDimCondition.Any(): 1}}
125+
)
126+
repo_config = ic.RepositoryConfig(manifest=ic.ManifestConfig(splitting=split_config))
127+
128+
storage = ic.local_filesystem_storage(
129+
f"/tmp/splitting-test/{random.randint(100, 20000)}"
130+
)
131+
# Note any config passed to Repository.create is persisted to disk.
132+
repo = ic.Repository.create(storage, config=repo_config)
133+
```
134+
135+
Create an array
136+
```python exec="on" session="perf" source="material-block"
137+
import zarr
138+
139+
session = repo.writable_session("main")
140+
root = zarr.group(session.store)
141+
name = "array"
142+
array = root.create_array(name=name, shape=(10,), dtype=int, chunks=(1,))
143+
```
144+
145+
Now lets write 5 chunk references
146+
```python exec="on" session="perf" source="material-block"
147+
import numpy as np
148+
149+
array[:5] = np.arange(10, 15)
150+
print(session.status())
151+
```
152+
153+
And commit
154+
```python exec="on" session="perf" source="material-block"
155+
snap = session.commit("Add 5 chunks")
156+
```
157+
158+
Use [`repo.lookup_snapshot`](./reference.md#icechunk.Repository.lookup_snapshot) to examine the manifests associated with a Snapshot
159+
```python exec="on" session="perf" source="material-block"
160+
print(repo.lookup_snapshot(snap).manifests)
161+
```
162+
163+
Let's open the Repository again with a different splitting config --- where 5 chunk references are in a single manifest.
164+
```python exec="on" session="perf" source="material-block"
165+
split_config = ManifestSplittingConfig.from_dict(
166+
{ManifestSplitCondition.AnyArray(): {ManifestSplitDimCondition.Any(): 5}}
167+
)
168+
repo_config = ic.RepositoryConfig(manifest=ic.ManifestConfig(splitting=split_config))
169+
new_repo = ic.Repository.open(storage, config=repo_config)
170+
print(new_repo.config.manifest)
171+
```
172+
173+
Now let's append data.
174+
```python exec="on" session="perf" source="material-block"
175+
session = new_repo.writable_session("main")
176+
array = zarr.open_array(session.store, path=name, mode="a")
177+
array[6:9] = [1, 2, 3]
178+
print(session.status())
179+
```
180+
181+
```python exec="on" session="perf" source="material-block"
182+
snap2 = session.commit("appended data")
183+
repo.lookup_snapshot(snap2).manifests
184+
```
185+
186+
Look carefully, only one new manifest with the 3 new chunk refs has been written.
187+
188+
Why?
189+
190+
Icechunk minimizes how many chunk references are rewritten at each commit (to save time and memory). The previous splitting configuration (split size of 1) results in manifests that are _compatible_ with the current configuration (split size of 5) because the bounding box of every existing manifest `[slice(0, 1), slice(1, 2), ...]` is fully contained in the the bounding boxes implied by the new configuration `[slice(0, 5), slice(5, 10)]`.
191+
192+
To force Icechunk to rewrite all chunk refs to the current splitting configuration use [`rewrite_manifests`](./reference.md#icechunk.Repository.rewrite_manifests) --- for the current example this will consolidate to two manifests.
193+
```python exec="on" session="perf" source="material-block"
194+
snap3 = new_repo.rewrite_manifests(
195+
f"rewrite_manifests with new config {str(split_config.to_dict())!r}", branch="main"
196+
)
197+
```
198+
199+
`rewrite_snapshots` will create a new commit on `branch` with the provided `message`.
200+
```python
201+
print(repo.lookup_snapshot(snap3).manifests)
202+
```
203+
204+
!!! important
205+
206+
Once you find a splitting configuration you like, remember to persist it on-disk using `repo.save_config`.

icechunk-python/python/icechunk/__init__.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,14 @@ def from_dict(split_sizes: SplitSizesDict) -> ManifestSplittingConfig:
195195
return ManifestSplittingConfig(unwrapped)
196196

197197

198+
def to_dict(config: ManifestSplittingConfig) -> SplitSizesDict:
199+
return {
200+
split_condition: dict(dim_conditions)
201+
for split_condition, dim_conditions in config.split_sizes
202+
}
203+
204+
198205
ManifestSplittingConfig.from_dict = from_dict # type: ignore[attr-defined]
206+
ManifestSplittingConfig.to_dict = to_dict # type: ignore[attr-defined]
199207

200208
initialize_logs()
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
from collections.abc import Iterable
2+
from typing import cast
3+
4+
import hypothesis.strategies as st
5+
6+
import icechunk as ic
7+
import zarr
8+
from zarr.core.metadata import ArrayV3Metadata
9+
10+
11+
@st.composite
12+
def splitting_configs(
13+
draw: st.DrawFn, *, arrays: Iterable[zarr.Array]
14+
) -> ic.ManifestSplittingConfig:
15+
config_dict = {}
16+
for array in arrays:
17+
if draw(st.booleans()):
18+
array_condition = ic.ManifestSplitCondition.name_matches(
19+
array.path.split("/")[-1]
20+
)
21+
else:
22+
array_condition = ic.ManifestSplitCondition.path_matches(array.path)
23+
dimnames = (
24+
cast(ArrayV3Metadata, array.metadata).dimension_names or (None,) * array.ndim
25+
)
26+
dimsize_axis_names = draw(
27+
st.lists(
28+
st.sampled_from(
29+
tuple(zip(array.shape, range(array.ndim), dimnames, strict=False))
30+
),
31+
min_size=1,
32+
unique=True,
33+
)
34+
)
35+
for size, axis, dimname in dimsize_axis_names:
36+
if dimname is None or draw(st.booleans()):
37+
key = ic.ManifestSplitDimCondition.Axis(axis)
38+
else:
39+
key = ic.ManifestSplitDimCondition.DimensionName(dimname) # type: ignore[assignment]
40+
config_dict[array_condition] = {
41+
key: draw(st.integers(min_value=1, max_value=size + 10))
42+
}
43+
return ic.ManifestSplittingConfig.from_dict(config_dict) # type: ignore[attr-defined, no-any-return]

icechunk-python/src/config.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1204,7 +1204,7 @@ impl PyManifestSplitDimCondition {
12041204
match self {
12051205
Axis(axis) => format!("Axis({})", axis),
12061206
DimensionName(name) => format!(r#"DimensionName("{}")"#, name),
1207-
Any() => "Rest".to_string(),
1207+
Any() => "Any".to_string(),
12081208
}
12091209
}
12101210

icechunk-python/tests/test_manifest_splitting.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,33 @@
66

77
import numpy as np
88
import pytest
9+
from hypothesis import given
10+
from hypothesis import strategies as st
911

1012
import icechunk as ic
1113
import xarray as xr
1214
import zarr
1315
from icechunk import ManifestSplitCondition, ManifestSplitDimCondition
16+
from icechunk.testing.strategies import splitting_configs
1417
from icechunk.xarray import to_icechunk
18+
from zarr.testing.strategies import arrays as zarr_arrays
1519

1620
SHAPE = (3, 4, 17)
1721
CHUNKS = (1, 1, 1)
1822
DIMS = ("time", "latitude", "longitude")
1923

2024

25+
@given(data=st.data())
26+
def test_splitting_config_dict_roundtrip(data):
27+
arrays = data.draw(
28+
st.lists(
29+
zarr_arrays(compressors=st.none(), attrs=st.none(), zarr_formats=st.just(3))
30+
)
31+
)
32+
config = data.draw(splitting_configs(arrays=arrays))
33+
assert ic.ManifestSplittingConfig.from_dict(config.to_dict()) == config
34+
35+
2136
def test_manifest_splitting_appends():
2237
array_condition = ManifestSplitCondition.or_conditions(
2338
[

icechunk-python/tests/test_zarr/test_stateful.py

Lines changed: 3 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1+
import functools
12
import json
2-
from collections.abc import Iterable
33
from typing import Any
44

55
import hypothesis.extra.numpy as npst
@@ -18,6 +18,7 @@
1818
import icechunk as ic
1919
import zarr
2020
from icechunk import Repository, Storage, in_memory_storage
21+
from icechunk.testing import strategies as icst
2122
from zarr.core.buffer import default_buffer_prototype
2223
from zarr.testing.stateful import ZarrHierarchyStateMachine
2324
from zarr.testing.strategies import (
@@ -37,9 +38,6 @@
3738
# ]
3839

3940

40-
import functools
41-
42-
4341
def with_frequency(frequency):
4442
"""
4543
Decorator to control how frequently a rule runs in Hypothesis stateful tests.
@@ -97,39 +95,6 @@ def chunk_paths(
9795
return "/".join(map(str, blockidx[subset_slicer]))
9896

9997

100-
@st.composite
101-
def splitting_configs(
102-
draw: st.DrawFn, *, arrays: Iterable[zarr.Array]
103-
) -> ic.ManifestSplittingConfig:
104-
config_dict = {}
105-
for array in arrays:
106-
if draw(st.booleans()):
107-
array_condition = ic.ManifestSplitCondition.name_matches(
108-
array.path.split("/")[-1]
109-
)
110-
else:
111-
array_condition = ic.ManifestSplitCondition.path_matches(array.path)
112-
dimnames = array.metadata.dimension_names or (None,) * array.ndim
113-
dimsize_axis_names = draw(
114-
st.lists(
115-
st.sampled_from(
116-
tuple(zip(array.shape, range(array.ndim), dimnames, strict=False))
117-
),
118-
min_size=1,
119-
unique=True,
120-
)
121-
)
122-
for size, axis, dimname in dimsize_axis_names:
123-
if dimname is None or draw(st.booleans()):
124-
key = ic.ManifestSplitDimCondition.Axis(axis)
125-
else:
126-
key = ic.ManifestSplitDimCondition.DimensionName(dimname)
127-
config_dict[array_condition] = {
128-
key: draw(st.integers(min_value=1, max_value=size + 10))
129-
}
130-
return ic.ManifestSplittingConfig.from_dict(config_dict)
131-
132-
13398
# TODO: more before/after commit invariants?
13499
# TODO: add "/" to self.all_groups, deleting "/" seems to be problematic
135100
class ModifiedZarrHierarchyStateMachine(ZarrHierarchyStateMachine):
@@ -149,7 +114,7 @@ def reopen_with_config(self, data):
149114
st.lists(st.sampled_from(sorted(self.all_arrays)), max_size=3, unique=True)
150115
)
151116
arrays = tuple(zarr.open_array(self.model, path=path) for path in array_paths)
152-
sconfig = data.draw(splitting_configs(arrays=arrays))
117+
sconfig = data.draw(icst.splitting_configs(arrays=arrays))
153118
config = ic.RepositoryConfig(
154119
inline_chunk_threshold_bytes=0, manifest=ic.ManifestConfig(splitting=sconfig)
155120
)

0 commit comments

Comments
 (0)