Skip to content

Commit d414d4b

Browse files
committed
Review feedback
1 parent f757b46 commit d414d4b

File tree

4 files changed

+92
-19
lines changed

4 files changed

+92
-19
lines changed

docs/docs/icechunk-python/performance.md

Lines changed: 62 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,9 @@ Coming Soon.
1616
## Splitting manifests
1717

1818
Icechunk stores chunk references in a chunk manifest file stored in `manifests/`.
19-
For very large arrays (millions of chunks), these files can get quite large.
2019
By default, Icechunk stores all chunk references in a single manifest file per array.
21-
Requesting even a single chunk requires downloading the entire manifest.
20+
For very large arrays (millions of chunks), these files can get quite large.
21+
Requesting even a single chunk will require downloading the entire manifest.
2222
In some cases, this can result in a slow time-to-first-byte or large memory usage.
2323
Similarly, appending a small amount of data to a large array requires
2424
downloading and rewriting the entire manifest.
@@ -47,7 +47,7 @@ repo_config = ic.RepositoryConfig(
4747
)
4848
```
4949

50-
Then pass the config to `Repository.open` or `Repository.create`
50+
Then pass the `config` to `Repository.open` or `Repository.create`
5151
```python
5252
repo = ic.Repository.open(..., config=repo_config)
5353
```
@@ -206,7 +206,13 @@ This ends up rewriting all refs to two new manifests.
206206

207207
### Rewriting manifests
208208

209-
To force Icechunk to rewrite all chunk refs to the current splitting configuration use [`rewrite_manifests`](./reference.md#icechunk.Repository.rewrite_manifests) --- for the current example this will consolidate to two manifests.
209+
Remember, by default Icechunk only writes one manifest per array regardless of size.
210+
For large enough arrays, you might see a relative performance hit while committing a new update (e.g. an append),
211+
or when reading from a Repository object that was just created.
212+
At that point, you will want to experiment with different manifest split configurations.
213+
214+
To force Icechunk to rewrite all chunk refs to the current splitting configuration use [`rewrite_manifests`](./reference.md#icechunk.Repository.rewrite_manifests)
215+
--- for the current example this will consolidate to two manifests.
210216

211217
To illustrate, we will use a split size of 3.
212218
```python exec="on" session="perf" source="material-block"
@@ -219,7 +225,7 @@ repo_config = ic.RepositoryConfig(
219225
new_repo = ic.Repository.open(storage, config=repo_config)
220226

221227
snap4 = new_repo.rewrite_manifests(
222-
f"rewrite_manifests with new config {str(split_config.to_dict())!r}", branch="main"
228+
f"rewrite_manifests with new config", branch="main"
223229
)
224230
```
225231

@@ -228,6 +234,57 @@ snap4 = new_repo.rewrite_manifests(
228234
print(repo.lookup_snapshot(snap4).manifests)
229235
```
230236

237+
The splitting configuration is saved in the snapshot metadata.
238+
```python exec="on" session="perf" source="material-block"
239+
print(repo.lookup_snapshot(snap4).metadata)
240+
```
241+
231242
!!! important
232243

233244
Once you find a splitting configuration you like, remember to persist it on-disk using `repo.save_config`.
245+
246+
247+
### Example workflow
248+
249+
Here is an example workflow for experimenting with splitting
250+
251+
```python exec="on" session="perf" source="material-block"
252+
# first define a new config
253+
split_config = ManifestSplittingConfig.from_dict(
254+
{ManifestSplitCondition.AnyArray(): {ManifestSplitDimCondition.Any(): 5}}
255+
)
256+
repo_config = ic.RepositoryConfig(
257+
manifest=ic.ManifestConfig(splitting=split_config),
258+
)
259+
# open the repo with the new config.
260+
repo = ic.Repository.open(storage, config=repo_config)
261+
```
262+
263+
We will rewrite the manifests on a different branch
264+
```python exec="on" session="perf" source="material-block"
265+
repo.create_branch("split-experiment-1")
266+
snap = repo.rewrite_manifests(
267+
f"rewrite_manifests with new config", branch="split-experiment-1"
268+
)
269+
```
270+
Now benchmark reads on `main` vs `split-experiment-1`
271+
```python exec="on" session="perf" source="material-block"
272+
store = repo.readonly_session("main").store
273+
store_split = repo.readonly_session("split-experiment-1").store
274+
# ...
275+
```
276+
Assume we decided the configuration on `split-experiment-1` was good.
277+
First we persist that configuration to disk
278+
```python exec="on" session="perf" source="material-block"
279+
repo.save_config()
280+
```
281+
282+
Now point main to the commit with rewritten manifests
283+
```python exec="on" session="perf" source="material-block"
284+
repo.reset_branch("main", repo.lookup_branch("split-experiment-1"))
285+
```
286+
287+
Notice that the persisted config is restored when opening a Repository
288+
```python exec="on" session="perf" source="material-block"
289+
print(ic.Repository.open(storage).config.manifest)
290+
```

icechunk-python/python/icechunk/repository.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -638,7 +638,14 @@ def rewrite_manifests(
638638
self, message: str, *, branch: str, metadata: dict[str, Any] | None = None
639639
) -> str:
640640
"""
641-
Rewrite manifests for all arrays and commit to the specified branch.
641+
Rewrite manifests for all arrays.
642+
643+
This method will start a new writable session on the specified branch,
644+
rewrite manifests for all arrays, and then commits with the specifeid ``messsage``
645+
and ``metadata``.
646+
647+
A JSON representation of the currently active splitting configuration will be
648+
stored in the commit's metadata under the key `"splitting_config"`.
642649
643650
Parameters
644651
----------

icechunk-python/src/repository.rs

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
use std::{
22
borrow::Cow,
33
collections::{BTreeMap, BTreeSet, HashMap, HashSet},
4-
ops::Deref,
54
sync::Arc,
65
};
76

@@ -941,7 +940,7 @@ impl PyRepository {
941940
let result =
942941
pyo3_async_runtimes::tokio::get_runtime().block_on(async move {
943942
let lock = self.0.read().await;
944-
rewrite_manifests(lock.deref(), branch, message, metadata)
943+
rewrite_manifests(&lock, branch, message, metadata)
945944
.await
946945
.map_err(PyIcechunkStoreError::ManifestOpsError)
947946
})?;

icechunk/src/session.rs

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,3 @@
1-
use std::{
2-
cmp::min,
3-
collections::{HashMap, HashSet},
4-
convert::Infallible,
5-
future::{Future, ready},
6-
ops::Range,
7-
pin::Pin,
8-
sync::Arc,
9-
};
10-
111
use async_stream::try_stream;
122
use bytes::Bytes;
133
use chrono::{DateTime, Utc};
@@ -16,6 +6,16 @@ use futures::{FutureExt, Stream, StreamExt, TryStreamExt, future::Either, stream
166
use itertools::{Itertools as _, enumerate, repeat_n};
177
use regex::bytes::Regex;
188
use serde::{Deserialize, Serialize};
9+
use serde_json::Value;
10+
use std::{
11+
cmp::min,
12+
collections::{BTreeMap, HashMap, HashSet},
13+
convert::Infallible,
14+
future::{Future, ready},
15+
ops::Range,
16+
pin::Pin,
17+
sync::Arc,
18+
};
1919
use thiserror::Error;
2020
use tokio::task::JoinError;
2121
use tracing::{Instrument, debug, info, instrument, trace, warn};
@@ -95,6 +95,8 @@ pub enum SessionErrorKind {
9595
Conflict { expected_parent: Option<SnapshotId>, actual_parent: Option<SnapshotId> },
9696
#[error("cannot rebase snapshot {snapshot} on top of the branch")]
9797
RebaseFailed { snapshot: SnapshotId, conflicts: Vec<Conflict> },
98+
#[error("error in serializing config to JSON")]
99+
JsonSerializationError(#[from] serde_json::Error),
98100
#[error("error in session serialization")]
99101
SerializationError(#[from] rmp_serde::encode::Error),
100102
#[error("error in session deserialization")]
@@ -954,6 +956,8 @@ impl Session {
954956
properties: Option<SnapshotProperties>,
955957
) -> SessionResult<SnapshotId> {
956958
let nodes = self.list_nodes().await?.collect::<Vec<_>>();
959+
// We need to populate the `splits` before calling `commit`.
960+
// In the normal chunk setting workflow, that would've been done by `set_chunk_ref`
957961
for node in nodes.into_iter().flatten() {
958962
if let NodeSnapshot {
959963
id,
@@ -965,7 +969,13 @@ impl Session {
965969
self.get_splits(&id, &path, &shape, &dimension_names);
966970
}
967971
}
968-
self._commit(message, properties, true).await
972+
973+
let splitting_config_serialized =
974+
serde_json::to_value(self.config.manifest().splitting())?;
975+
let mut properties =
976+
properties.unwrap_or_else(|| BTreeMap::<String, Value>::new());
977+
properties.insert("splitting_config".to_string(), splitting_config_serialized);
978+
self._commit(message, Some(properties), true).await
969979
}
970980

971981
#[instrument(skip(self, properties))]

0 commit comments

Comments
 (0)