Skip to content

Commit 5754e00

Browse files
committed
Merge branch 'main' into empty-commits
* main: Error fast when can_write is false and calling gc + expiration (#1014) Document cold buckets and GCS 429 problem (#1009) Add initial commit id as constant (#1008) Fix typo (#1010)
2 parents 6681b2a + 858fdea commit 5754e00

File tree

7 files changed

+58
-7
lines changed

7 files changed

+58
-7
lines changed

docs/docs/icechunk-python/performance.md

Lines changed: 33 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,36 @@
99

1010
Icechunk is designed to be cloud native, making it able to take advantage of the horizontal scaling of cloud providers. To learn more, check out [this blog post](https://earthmover.io/blog/exploring-icechunk-scalability) which explores just how well Icechunk can perform when matched with AWS S3.
1111

12+
## Cold buckets and repos
13+
14+
Modern object stores usually reshard their buckets on-the-fly, based on perceived load. The
15+
strategies they use are not published and very hard to discover. The details are not super important
16+
anyway, the important take away is that on new buckets and even on new repositories, the scalability
17+
of the object store may not be great from the start. You are expected to slowly ramp up load, as you
18+
write data to the repository.
19+
20+
Once you have applied consistently high write/read load to a repository for a few minutes, the object
21+
store will usually reshard your bucket allowing for more load. While this resharding happens, different
22+
object stores can respond in different ways. For example, S3 returns 5xx errors with a "SlowDown"
23+
indication. GCS returns 429 responses.
24+
25+
Icechunk helps this process by retrying failed requests with an exponential backoff. In our
26+
experience, the default configuration is enough to ingest into a fresh bucket using around 100 machines.
27+
But if this is not the case for you, you can tune the retry configuration using [StorageRetriesSettings](https://icechunk.io/en/latest/icechunk-python/reference/#icechunk.StorageRetriesSettings).
28+
29+
To learn more about how Icechunk manages object store prefixes, read our
30+
[blog post](https://earthmover.io/blog/exploring-icechunk-scalability)
31+
on Icechunk scalability.
32+
33+
!!! warning
34+
35+
Currently, Icechunk implementation of retry logic during resharding is not
36+
[working properly](https://github.com/earth-mover/icechunk/issues/954) on GCS.
37+
We have a [pull request open](https://github.com/apache/arrow-rs-object-store/pull/410) to
38+
one of Icechunk's dependencies that will solve this.
39+
In the meantime, if you get 429 errors from your Google bucket, please lower concurrency and try
40+
again. Increase concurrency slowly until errors disappear.
41+
1242
## Preloading manifests
1343

1444
Coming Soon.
@@ -65,8 +95,8 @@ Options for specifying the arrays whose manifest you want to split are:
6595
3. [`ManifestSplitCondition.and_conditions`](./reference.md#icechunk.ManifestSplitCondition.and_conditions) to combine (1), (2), and (4) together; and
6696
4. [`ManifestSplitCondition.or_conditions`](./reference.md#icechunk.ManifestSplitCondition.or_conditions) to combine (1), (2), and (3) together.
6797

68-
6998
`And` and `Or` may be used to combine multiple path and/or name matches. For example,
99+
70100
```python exec="on" session="perf" source="material-block"
71101
array_condition = ManifestSplitCondition.or_conditions(
72102
[
@@ -85,8 +115,8 @@ Options for specifying how to split along a specific axis or dimension are:
85115
2. [`ManifestSplitDimCondition.DimensionName`](./reference.md#icechunk.ManifestSplitDimCondition.DimensionName) takes a regular expression used to match the dimension names of the array;
86116
3. [`ManifestSplitDimCondition.Any`](./reference.md#icechunk.ManifestSplitDimCondition.Any) matches any _remaining_ dimension name or axis.
87117

88-
89118
For example, for an array with dimensions `time, latitude, longitude`, the following config
119+
90120
```python exec="on" session="perf" source="material-block"
91121
from icechunk import ManifestSplitDimCondition
92122

@@ -96,8 +126,8 @@ from icechunk import ManifestSplitDimCondition
96126
ManifestSplitDimCondition.Any(): 1,
97127
}
98128
```
99-
will result in splitting manifests so that each manifest contains (3 longitude chunks x 2 latitude chunks x 1 time chunk) = 6 chunks per manifest file.
100129

130+
will result in splitting manifests so that each manifest contains (3 longitude chunks x 2 latitude chunks x 1 time chunk) = 6 chunks per manifest file.
101131

102132
!!! note
103133

docs/docs/icechunk-python/virtual.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ Now we can read the dataset from the store using xarray to confirm everything we
120120

121121
```python
122122
ds = xr.open_zarr(
123-
store,
123+
session.store,
124124
zarr_version=3,
125125
consolidated=False,
126126
chunks={},

icechunk-python/tests/test_timetravel.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@ def test_timetravel() -> None:
124124
"commit 1",
125125
"Repository initialized",
126126
]
127+
assert parents[-1].id == "1CECHNKREP0F1RSTCMT0"
127128
assert [len(snap.manifests) for snap in parents] == [1, 1, 1, 0]
128129
assert sorted(parents, key=lambda p: p.written_at) == list(reversed(parents))
129130
assert len(set([snap.id for snap in parents])) == 4

icechunk/src/format/mod.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ impl<const SIZE: usize, T: FileTypeTag> ObjectId<SIZE, T> {
9898
Self(buf, PhantomData)
9999
}
100100

101-
pub fn new(buf: [u8; SIZE]) -> Self {
101+
pub const fn new(buf: [u8; SIZE]) -> Self {
102102
Self(buf, PhantomData)
103103
}
104104

icechunk/src/format/snapshot.rs

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -340,6 +340,10 @@ static ROOT_OPTIONS: VerifierOptions = VerifierOptions {
340340

341341
impl Snapshot {
342342
pub const INITIAL_COMMIT_MESSAGE: &'static str = "Repository initialized";
343+
pub const INITIAL_SNAPSHOT_ID: SnapshotId = SnapshotId::new([
344+
0x0b, 0x1c, 0xc8, 0xd6, 0x78, 0x75, 0x80, 0xf0, 0xe3, 0x3a, 0x65,
345+
0x34, // Decodes as 1CECHNKREP0F1RSTCMT0
346+
]);
343347

344348
pub fn from_buffer(buffer: Vec<u8>) -> IcechunkResult<Snapshot> {
345349
let _ = flatbuffers::root_with_opts::<generated::Snapshot>(
@@ -429,7 +433,7 @@ impl Snapshot {
429433
let properties = [("__root".to_string(), serde_json::Value::from(true))].into();
430434
let nodes: Vec<Result<NodeSnapshot, Infallible>> = Vec::new();
431435
Self::from_iter(
432-
None,
436+
Some(Self::INITIAL_SNAPSHOT_ID),
433437
None,
434438
Self::INITIAL_COMMIT_MESSAGE.to_string(),
435439
Some(properties),

icechunk/src/ops/gc.rs

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ use crate::{
1414
},
1515
ops::pointed_snapshots,
1616
refs::{Ref, RefError, delete_branch, delete_tag, list_refs},
17-
repository::RepositoryError,
17+
repository::{RepositoryError, RepositoryErrorKind},
1818
storage::{self, DeleteObjectsResult, ListInfo},
1919
};
2020

@@ -160,6 +160,13 @@ pub async fn garbage_collect(
160160
asset_manager: Arc<AssetManager>,
161161
config: &GCConfig,
162162
) -> GCResult<GCSummary> {
163+
if !storage.can_write() {
164+
return Err(GCError::Repository(
165+
RepositoryErrorKind::ReadonlyStorage("Cannot garbage collect".to_string())
166+
.into(),
167+
));
168+
}
169+
163170
// TODO: this function could have much more parallelism
164171
if !config.action_needed() {
165172
tracing::info!("No action requested");
@@ -543,6 +550,13 @@ pub async fn expire(
543550
expired_branches: ExpiredRefAction,
544551
expired_tags: ExpiredRefAction,
545552
) -> GCResult<ExpireResult> {
553+
if !storage.can_write() {
554+
return Err(GCError::Repository(
555+
RepositoryErrorKind::ReadonlyStorage("Cannot expire snapshots".to_string())
556+
.into(),
557+
));
558+
}
559+
546560
let all_refs = stream::iter(list_refs(storage, storage_settings).await?);
547561
let asset_manager = Arc::clone(&asset_manager.clone());
548562

icechunk/src/session.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3623,6 +3623,7 @@ mod tests {
36233623
assert_eq!(parents[0].message, "second commit");
36243624
assert_eq!(parents[1].message, "first commit");
36253625
assert_eq!(parents[2].message, Snapshot::INITIAL_COMMIT_MESSAGE);
3626+
assert_eq!(parents[2].id, Snapshot::INITIAL_SNAPSHOT_ID);
36263627
itertools::assert_equal(
36273628
parents.iter().sorted_by_key(|m| m.flushed_at).rev(),
36283629
parents.iter(),
@@ -3692,6 +3693,7 @@ mod tests {
36923693
assert!(msg == "from 1" || msg == "from 2");
36933694

36943695
assert_eq!(parents[1].message.as_str(), Snapshot::INITIAL_COMMIT_MESSAGE);
3696+
assert_eq!(parents[1].id, Snapshot::INITIAL_SNAPSHOT_ID);
36953697
Ok(())
36963698
}
36973699

0 commit comments

Comments
 (0)