-
Notifications
You must be signed in to change notification settings - Fork 51
Add faq answer comparing kerchunk format to icechunk format #818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
4997a1f
099f730
ddb479a
d9982db
41c2fa2
51440f9
f88dc7f
b3c6438
fd17d62
1ba159c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -117,10 +117,16 @@ I'm glad you asked! We can think of the problem of providing virtualized zarr-li | |||||||||||
|
|
||||||||||||
| The above steps could also be performed using the `kerchunk` library alone, but because (3), (4), (5), and (6) are all performed by the `kerchunk.combine.MultiZarrToZarr` function, and no internal abstractions are exposed, kerchunk's design is much less modular, and the use cases are limited by kerchunk's API surface. | ||||||||||||
|
|
||||||||||||
| ## How do VirtualiZarr and Kerchunk compare? | ||||||||||||
| ## How do the VirtualiZarr and Kerchunk libraries compare? | ||||||||||||
|
|
||||||||||||
| You have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides almost all the same features as Kerchunk. | ||||||||||||
|
|
||||||||||||
| !!! note | ||||||||||||
|
|
||||||||||||
| "Kerchunk" is really two things: a python library and an on-disk format for storing virtual references. | ||||||||||||
| This question compares the Kerchunk python library to the VirtualiZarr python library. | ||||||||||||
| For a discussion of the pros and cons of serializing into the Kerchunk references format, see the [next question](#which-format-should-i-save-my-virtual-references-as). | ||||||||||||
|
|
||||||||||||
| Users of Kerchunk may find the following comparison table useful, which shows which features of Kerchunk map on to which features of VirtualiZarr. | ||||||||||||
|
|
||||||||||||
| | Component / Feature | Kerchunk | VirtualiZarr | | ||||||||||||
|
|
@@ -160,6 +166,29 @@ Users of Kerchunk may find the following comparison table useful, which shows wh | |||||||||||
| | Kerchunk reference format as parquet | `df.refs_to_dataframe(out_dict, "combined.parq")`, then read using an `fsspec` `ReferenceFileSystem` mapper | `ds.vz.to_kerchunk('combined.parq', format=parquet')` , then read using an `fsspec` `ReferenceFileSystem` mapper | | ||||||||||||
| | [Icechunk](https://icechunk.io/) store | ❌ | `ds.vz.to_icechunk()`, then read back via xarray (requires zarr-python v3). | | ||||||||||||
|
|
||||||||||||
| ### Which format should I save my virtual references as? | ||||||||||||
|
|
||||||||||||
| VirtualiZarr allows you to write virtual references to a few formats: currently Kerchunk JSON, Kerchunk Parquet, and [Icechunk](https://icechunk.io/en/latest/). | ||||||||||||
|
|
||||||||||||
| Icechunk provides several compelling advantages over either Kerchunk format: | ||||||||||||
|
|
||||||||||||
| - **Ensure referenced data has not changed** - An inherent risk of the virtual references approach is someone could overwrite, update, or delete the referenced archival file between the time when the virtual references were parsed and the time that a user attempts to read the data. With Kerchunk this scenario could lead to incorrect data being returned silently. In Icechunk, the last-modified time of each file is also saved, and checked at read-time. Therefore a user will get a clear error if a file has been touched since the virtual references were created. | ||||||||||||
| - **Transactions** - Icechunk stores are updated via commits, each of which is effectively a single database-like transaction. This helps guarantee consistency of the virtual references you write, by making it impossible for someone reading the data to see a half-written state. | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Half written slightly unclear here, what's being written is the references? |
||||||||||||
| - **Version Control and Time Travel** - Icechunk stores a git-like history of all commits, allowing you to roll back to any previous version, or even create multiple branches and tags. See the [Icechunk docs on Version Control](https://icechunk.io/en/latest/version-control/). | ||||||||||||
| - **Read performance** - Reading data from Icechunk is faster than reading from Kerchunk references. This is because reading from Kerchunk references is done using the fsspec python library, whereas reading data from Icechunk (virtual references or native chunks) uses the Icechunk rust library. For this and a number of other reasons, reading data from Icechunk generally provides a much higher throughput. | ||||||||||||
| - **Incremental overwriting** - VirtualiZarr's `.to_icechunk` API allows you to write to a specific region. This is more difficult to do safely when writing to Kerchunk's format because it would generally require editing part of a single file. | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I found this a bit confusing. safely because I might muck up the kerchunk json file? |
||||||||||||
| - **Mix "native" and virtual chunks** - Icechunk's manifests can store any mixture of virtual chunks and "native" zarr chunks. Kerchunk's formats cannot do this ("inlined" chunks are something separate). | ||||||||||||
| - **Scalability** - Kerchunk JSON does not scale well to a large number of virtual references. Note that Kerchunk Parquet is much more scalable than Kerchunk JSON, but in theory the scalability of Icechunk manifests should be similar to that of Kerchunk Parquet because they both have partitioning (Icechunk calls this ["Manifest Splitting"](https://icechunk.io/en/latest/performance/#splitting-manifests)). | ||||||||||||
|
|
||||||||||||
| However a direct head-to-head comparison of the scalability of these formats has yet to be performed. | ||||||||||||
|
|
||||||||||||
| Conversely, the two Kerchunk formats have some advantages over Icechunk: | ||||||||||||
| - **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable. | ||||||||||||
|
Comment on lines
+185
to
+186
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Gap is needed for this to be rendered as a bulleted list |
||||||||||||
| - **Write latency** - In theory writing a single JSON or writing Parquet to object storage can be done with a smaller number of roundtrips. However this time taken will almost always be negligible compared to the time taken to parse the archival file formats. | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "smaller number of roundtrips" than icechunk manifest? |
||||||||||||
|
|
||||||||||||
| (Note that in theory both formats could generalize to store data which does not use the Zarr data model, but in practice neither has ever been used for this purpose.) | ||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
IMO, this doesn't help answer the question "Which format should I save my virtual references as?" |
||||||||||||
|
|
||||||||||||
| Overall we strongly recommend using Icechunk over the Kerchunk formats, though VirtualiZarr will continue to support writing to both. | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd put this at the top rather than the bottom. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also might be nice to clarify the support. support both in perpetuity? both read and write, or eventually just reading the kerchunk format? |
||||||||||||
|
|
||||||||||||
| ## Development | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible that on some systems a malicious actor could change the file and leave the last modified as is. But there is no defense against this, so may not be worht mentioning