Skip to content
31 changes: 30 additions & 1 deletion docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,10 +117,16 @@ I'm glad you asked! We can think of the problem of providing virtualized zarr-li

The above steps could also be performed using the `kerchunk` library alone, but because (3), (4), (5), and (6) are all performed by the `kerchunk.combine.MultiZarrToZarr` function, and no internal abstractions are exposed, kerchunk's design is much less modular, and the use cases are limited by kerchunk's API surface.

## How do VirtualiZarr and Kerchunk compare?
## How do the VirtualiZarr and Kerchunk libraries compare?

You have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides almost all the same features as Kerchunk.

!!! note

"Kerchunk" is really two things: a python library and an on-disk format for storing virtual references.
This question compares the Kerchunk python library to the VirtualiZarr python library.
For a discussion of the pros and cons of serializing into the Kerchunk references format, see the [next question](#which-format-should-i-save-my-virtual-references-as).

Users of Kerchunk may find the following comparison table useful, which shows which features of Kerchunk map on to which features of VirtualiZarr.

| Component / Feature | Kerchunk | VirtualiZarr |
Expand Down Expand Up @@ -160,6 +166,29 @@ Users of Kerchunk may find the following comparison table useful, which shows wh
| Kerchunk reference format as parquet | `df.refs_to_dataframe(out_dict, "combined.parq")`, then read using an `fsspec` `ReferenceFileSystem` mapper | `ds.vz.to_kerchunk('combined.parq', format=parquet')` , then read using an `fsspec` `ReferenceFileSystem` mapper |
| [Icechunk](https://icechunk.io/) store | ❌ | `ds.vz.to_icechunk()`, then read back via xarray (requires zarr-python v3). |

### Which format should I save my virtual references as?

VirtualiZarr allows you to write virtual references to a few formats: currently Kerchunk JSON, Kerchunk Parquet, and [Icechunk](https://icechunk.io/en/latest/).

Icechunk provides several compelling advantages over either Kerchunk format:

- **Ensure referenced data has not changed** - An inherent risk of the virtual references approach is someone could overwrite, update, or delete the referenced archival file between the time when the virtual references were parsed and the time that a user attempts to read the data. With Kerchunk this scenario could lead to incorrect data being returned silently. In Icechunk, the last-modified time of each file is also saved, and checked at read-time. Therefore a user will get a clear error if a file has been touched since the virtual references were created.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible that on some systems a malicious actor could change the file and leave the last modified as is. But there is no defense against this, so may not be worht mentioning

- **Transactions** - Icechunk stores are updated via commits, each of which is effectively a single database-like transaction. This helps guarantee consistency of the virtual references you write, by making it impossible for someone reading the data to see a half-written state.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Half written slightly unclear here, what's being written is the references?

- **Version Control and Time Travel** - Icechunk stores a git-like history of all commits, allowing you to roll back to any previous version, or even create multiple branches and tags. See the [Icechunk docs on Version Control](https://icechunk.io/en/latest/version-control/).
- **Read performance** - Reading data from Icechunk is faster than reading from Kerchunk references. This is because reading from Kerchunk references is done using the fsspec python library, whereas reading data from Icechunk (virtual references or native chunks) uses the Icechunk rust library. For this and a number of other reasons, reading data from Icechunk generally provides a much higher throughput.
- **Incremental overwriting** - VirtualiZarr's `.to_icechunk` API allows you to write to a specific region. This is more difficult to do safely when writing to Kerchunk's format because it would generally require editing part of a single file.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this a bit confusing. safely because I might muck up the kerchunk json file?

- **Mix "native" and virtual chunks** - Icechunk's manifests can store any mixture of virtual chunks and "native" zarr chunks. Kerchunk's formats cannot do this ("inlined" chunks are something separate).
- **Scalability** - Kerchunk JSON does not scale well to a large number of virtual references. Note that Kerchunk Parquet is much more scalable than Kerchunk JSON, but in theory the scalability of Icechunk manifests should be similar to that of Kerchunk Parquet because they both have partitioning (Icechunk calls this ["Manifest Splitting"](https://icechunk.io/en/latest/performance/#splitting-manifests)).

However a direct head-to-head comparison of the scalability of these formats has yet to be performed.

Conversely, the two Kerchunk formats have some advantages over Icechunk:
- **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.
Comment on lines +185 to +186
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Conversely, the two Kerchunk formats have some advantages over Icechunk:
- **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.
Conversely, the two Kerchunk formats have some advantages over Icechunk:
- **Standard file formats** - JSON and Parquet are very standard formats, readable by many tools, and JSON is even human-readable. Icechunk uses flatbuffers, which are standardized but not human-readable.

Gap is needed for this to be rendered as a bulleted list

- **Write latency** - In theory writing a single JSON or writing Parquet to object storage can be done with a smaller number of roundtrips. However this time taken will almost always be negligible compared to the time taken to parse the archival file formats.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"smaller number of roundtrips" than icechunk manifest?


(Note that in theory both formats could generalize to store data which does not use the Zarr data model, but in practice neither has ever been used for this purpose.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(Note that in theory both formats could generalize to store data which does not use the Zarr data model, but in practice neither has ever been used for this purpose.)

IMO, this doesn't help answer the question "Which format should I save my virtual references as?"


Overall we strongly recommend using Icechunk over the Kerchunk formats, though VirtualiZarr will continue to support writing to both.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this at the top rather than the bottom.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also might be nice to clarify the support. support both in perpetuity? both read and write, or eventually just reading the kerchunk format?


## Development

Expand Down
3 changes: 3 additions & 0 deletions docs/releases.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@

### Documentation

- Added FAQ answer comparing the Kerchunk and Icechunk serialization formats. ([#818](https://github.com/zarr-developers/VirtualiZarr/pull/818)).
By [Tom Nicholas](https://github.com/TomNicholas).

### Internal changes

## v2.1.2 (3rd September 2025)
Expand Down