Description
TL;DR - I believe modularizing zarr.js
further and stripping out unused/unnecessary features will make maintenance and adoption easier in the future. The key is deciding what are "core" features and prioritizing design around those as package exports.
Description
I love using zarr-python
with zarr.js
, but in its current state (even with the /core
export) zarr.js
is a bulky dependency (~150kb unminified). Although it is pure ESM and can be treeshaken, the primary issue I've come across is that you can't treeshake classes, and therefore in applications that I just need ZarrArray.getRawChunk
, I still need to "pay the price" for get
/getRaw
/set
/setRaw
which contain all the chunk indexing/slicing logic.
To this end, I know that it's challenging to convince others to adopt zarr
for web-applications (cc: @kylebarron). For example, it would be awesome to add a zarr tile loader to loaders.gl
, but right now it makes more sense to duplicate parts of zarr.js
to decrease the code footprint instead of importing exports from zarr
.
Proposal 1: fully tree-shakeable submodules
I've been experimenting with an implementation of zarr v3 spec (zarrita.js
), that is structured in a way to that you can progressively opt into more features (and code) if needed for your application. Right now the zarrita/core
module only implements ZarrArray.get_chunk
, and the top-level export extends the ZarrArray prototype to add indexing and slicing and writing to the array (e.g. ZarrArray.set
, ZarrArray.get
). Writing the v3 module in this way, I've been able to reuse the core/indexing logic in zarr-lite
which is a minimal v2 implementation for reading chunks and optionally array selections.
Core zarr features
Specifically, I've been thinking about what are the core features of zarr? More motivation can be found on this twitter thread. To me, there are tiers of use that I'd like a zarr implementation to have on the web:
1.) Traverse hierarchy (open nodes of hierarchy, e.g. groups and arrays) + read chunks by key (most minimal set of features useful for the web – getRawChunk
)
2.) (1) + read slices of array using array selection (typically how you think of using zarr in python –getRaw
))
3.) (1) + (2) + create (write) to hierarchy (generally unused on the web, but most feature "complete")
Right now, zarr.js
is (3), but there is no way to just get the behavior or 1 or 2. In zarr-lite
I've implemented 1 and 2 as separate exports:
- (1) https://observablehq.com/@manzt/using-zarr-lite
- (2) https://observablehq.com/@manzt/using-zarr-lite-indexing
This works by using Object.defineProperties
and extending the ZarrArray
prototype with additional methods in the submodule: https://github.com/manzt/zarr-lite/blob/main/src/indexing.js.
Benefits of isolating core exports
The main benefit of isolating modules is reduction in code side. You can import from the top-level export and get a "batteries included" zarr, but those writing other libraries or wanting to use core functionally can just import the submodules that make sense for their application. This means that the code for decoding chunks from an array won't be duplicated outside of zarr.js
and instead folks will just add zarr as s dependency.
I also think this has the added benefit of standardizing how features like #16 could be implemented without creating a cost for "core" library users. Win/win.
Proposal 2: remove/move RawArray
and NestedArray
from zarr.js
In zarrita.js
, I experimented with using a minimal object to represent an in-memory array: { data, shape, stride }
. As a result, the arrays returned from ZarrArray.get
can be composed into other JS ndarray libraries.
import ndarray from 'ndarray';
const selection = [slice(1,4), slice(2,7)];
const { data, shape, stride } = await z.get(selection);
const arr = ndarray(data, shape, stride);
/* perform some array operations */
await a.set(selection, arr); // set using ndarray since arr has properties { data, shape, stride }.
This removes the need for any array class implementation, and zarrita/indexing
only needs to implement a set
function to fill and out
+ out_selection
from chunk
+ chunk_selection
. https://github.com/manzt/zarrita.js/blob/main/src/ops.js
The big change is that zarr.js
would serve as an interface to read slices of massive datasets into memory, but if you want to do something fancy with those arrays, you need to choose a numpy-like library. This would mean there is also much less code we need to maintain, and we can punt on creating something numpy-like for the web. It doesn't need to be a part of zarr.js at all, but a layer on top that we don't need to be responsible for.
Proposal 3: reduce code for metadata validation
I think that metadata validation should be very minimal. We control how the metadata is written (if writing) but also zarr-python
takes care of writing metadata, so I don't think we need the overhead of something like a runtime typescript interface checker, but instead a small function that maybe checks a few fields for compatibility in javascript.
Proposal 4: Remove python-isms
We should think deeply about translating native python-isms to native JavaScript alternative. For example, to implement a valid store, currently we need to throw a custom KeyError
from zarr
if a chunk is missing. This makes any store implementation require zarr.js
as a dependency, and adds the addtional (confusing requirement) and that you also can't bundle that store as a separate dependency (instanceof myCustomStore.KeyError !== instanceof zarr.KeyError
) if I bundle myCustomStore
independently. I'd like to have the ability to:
import { openArray } from 'https://cdn.skypack.dev/zarr';
import MyCustomStore from 'https://cdn.skypack.dev/@manzt/store'; // no dependency on `zarr`
const store = MyCustomStore();
const z = await openArray({ store });
A KeyError
is a built in python error that's thrown when a key is missing from a MutableMapping
. In JavaScript, the equivalent is to just return undefined
. This would remove a lot of the try
/catch
blocks from the core implementation and instead we would just call:
// Store returns 'undefined' instead of throwing custom error if chunk is missing.
// Otherwise the store can throw any other error and it will propagate up.
const chunk = await this.store.get(ckey);
if (!chunk) {
/* create the missing chunk */
} else {
/* decode the chunk */
}
As an aside, I also feel that we should standardize store interfaces to be an extension of an ES6 Maps, but where are the functions are optionally async.
Final thoughts
My apologies for the length of this issue, but these are some thoughts that I've had after using zarr.js
on the web for some time. I wanted to note these ideas and my experiments with zarrita.js
and zarr-lite
here so that we could maybe incorporate some of those lessons learned into a leaner, easier-to-maintain, zarr.js
. This is something I might have the time to work on in the future, but it would mean substantial changes to this repo in it's current state so I don't want to proceed with PRs until we've had a longer discussion.
Excited to hear your thoughts.