You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* main:
Error fast when can_write is false and calling gc + expiration (#1014)
Document cold buckets and GCS 429 problem (#1009)
Add initial commit id as constant (#1008)
Fix typo (#1010)
Copy file name to clipboardExpand all lines: docs/docs/icechunk-python/performance.md
+33-3Lines changed: 33 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,36 @@
9
9
10
10
Icechunk is designed to be cloud native, making it able to take advantage of the horizontal scaling of cloud providers. To learn more, check out [this blog post](https://earthmover.io/blog/exploring-icechunk-scalability) which explores just how well Icechunk can perform when matched with AWS S3.
11
11
12
+
## Cold buckets and repos
13
+
14
+
Modern object stores usually reshard their buckets on-the-fly, based on perceived load. The
15
+
strategies they use are not published and very hard to discover. The details are not super important
16
+
anyway, the important take away is that on new buckets and even on new repositories, the scalability
17
+
of the object store may not be great from the start. You are expected to slowly ramp up load, as you
18
+
write data to the repository.
19
+
20
+
Once you have applied consistently high write/read load to a repository for a few minutes, the object
21
+
store will usually reshard your bucket allowing for more load. While this resharding happens, different
22
+
object stores can respond in different ways. For example, S3 returns 5xx errors with a "SlowDown"
23
+
indication. GCS returns 429 responses.
24
+
25
+
Icechunk helps this process by retrying failed requests with an exponential backoff. In our
26
+
experience, the default configuration is enough to ingest into a fresh bucket using around 100 machines.
27
+
But if this is not the case for you, you can tune the retry configuration using [StorageRetriesSettings](https://icechunk.io/en/latest/icechunk-python/reference/#icechunk.StorageRetriesSettings).
28
+
29
+
To learn more about how Icechunk manages object store prefixes, read our
@@ -85,8 +115,8 @@ Options for specifying how to split along a specific axis or dimension are:
85
115
2.[`ManifestSplitDimCondition.DimensionName`](./reference.md#icechunk.ManifestSplitDimCondition.DimensionName) takes a regular expression used to match the dimension names of the array;
86
116
3.[`ManifestSplitDimCondition.Any`](./reference.md#icechunk.ManifestSplitDimCondition.Any) matches any _remaining_ dimension name or axis.
87
117
88
-
89
118
For example, for an array with dimensions `time, latitude, longitude`, the following config
@@ -96,8 +126,8 @@ from icechunk import ManifestSplitDimCondition
96
126
ManifestSplitDimCondition.Any(): 1,
97
127
}
98
128
```
99
-
will result in splitting manifests so that each manifest contains (3 longitude chunks x 2 latitude chunks x 1 time chunk) = 6 chunks per manifest file.
100
129
130
+
will result in splitting manifests so that each manifest contains (3 longitude chunks x 2 latitude chunks x 1 time chunk) = 6 chunks per manifest file.
0 commit comments