Thoughts towards a leaner zarr.js

**TL;DR - I believe modularizing `zarr.js` further and stripping out unused/unnecessary features will make maintenance and adoption easier in the future. The key is deciding what are "core" features and prioritizing design around those as package exports.**

## Description

I love using `zarr-python` with `zarr.js`, but in its current state (even with the `/core` export) `zarr.js` is a bulky dependency (~150kb unminified). Although it is pure ESM and can be treeshaken, the primary issue I've come across is that you [can't treeshake classes](https://www.youtube.com/watch?v=lsd2-TCgHEs&t=307s), and therefore in applications that I just need `ZarrArray.getRawChunk`, I still need to "pay the price" for `get`/`getRaw`/`set`/`setRaw` which contain all the chunk indexing/slicing logic.

To this end, I know that it's challenging to convince others to adopt `zarr` for web-applications (cc: @kylebarron). For example, it would be awesome to add a zarr tile loader to [`loaders.gl`](https://loaders.gl/), but right now it makes more sense to duplicate parts of `zarr.js` to decrease the code footprint instead of importing exports from `zarr`.


## Proposal 1: fully tree-shakeable submodules

I've been experimenting with an implementation of zarr v3 spec ([`zarrita.js`](https://github.com/manzt/zarrita.js)), that is structured in a way to that you can progressively opt into more features (and code) if needed for your application. Right now the `zarrita/core` module only implements `ZarrArray.get_chunk`, and the top-level export [extends the ZarrArray prototype](https://github.com/manzt/zarrita.js/blob/aa865808f6a1188e5bc4f9eb8c9e8c5228537a48/src/zarrita.js#L7-L32) to add indexing and slicing and writing to the array (e.g. `ZarrArray.set`, `ZarrArray.get`). Writing the v3 module in this way, I've been able to reuse the core/indexing logic in [`zarr-lite`](https://github.com/manzt/zarr-lite) which is a minimal v2 implementation for reading chunks and optionally array selections.

### Core zarr features

Specifically, I've been thinking about **what are the core features of zarr**? More motivation can be found on this [twitter thread](https://twitter.com/trevmanz/status/1326647318028890112). To me, there are tiers of use that I'd like a zarr implementation to have on the web:

1.) Traverse hierarchy (open nodes of hierarchy, e.g. groups and arrays) + **read** chunks by key (most minimal set of features useful for the web – `getRawChunk`)
2.) (1) + **read** slices of array using array selection (typically how you think of using zarr in python –`getRaw`))
3.) (1) + (2) + create (**write**) to hierarchy  (generally unused on the web, but most feature "complete")

Right now, `zarr.js` is (3), but there is no way to just get the behavior or 1 or 2. In `zarr-lite` I've implemented 1 and 2 as separate exports:

- (1) https://observablehq.com/@manzt/using-zarr-lite
- (2) https://observablehq.com/@manzt/using-zarr-lite-indexing

This works by using `Object.defineProperties` and extending the `ZarrArray` prototype with additional methods in the submodule: https://github.com/manzt/zarr-lite/blob/main/src/indexing.js. 

### Benefits of isolating core exports

The main benefit of isolating modules is reduction in code side. You can import from the top-level export and get a "batteries included" zarr, but those writing other libraries or wanting to use core functionally can just import the submodules that make sense for their application. This means that the code for decoding chunks from an array won't be duplicated outside of `zarr.js` and instead folks will just add zarr as s dependency. 

I also think this has the added benefit of standardizing how features like #16 could be implemented without creating a cost for "core" library users. Win/win.

## Proposal 2: remove/move `RawArray` and `NestedArray` from `zarr.js`

In `zarrita.js`, I experimented with using a minimal object to represent an in-memory array: `{ data, shape, stride }`. As a result, the arrays returned from `ZarrArray.get` can be composed into [other JS ndarray libraries](https://github.com/manzt/zarrita.js#compatibility-with-ndarray).

```javascript
import ndarray from 'ndarray';

const selection = [slice(1,4), slice(2,7)];
const { data, shape, stride } = await z.get(selection);
const arr = ndarray(data, shape, stride);

/* perform some array operations */
await a.set(selection, arr); // set using ndarray since arr has properties { data, shape, stride }.
```

This removes the need for any array class implementation, and `zarrita/indexing` only needs to implement a `set` function to fill and `out` + `out_selection` from `chunk` + `chunk_selection`. https://github.com/manzt/zarrita.js/blob/main/src/ops.js

The big change is that `zarr.js` would serve as an interface to read slices of massive datasets into memory, but if you want to do something fancy with those arrays, you need to choose a numpy-like library. This would mean there is also much less code we need to maintain, and we can punt on creating something numpy-like for the web. It doesn't need to be a part of zarr.js at all, but a layer on top that we don't need to be responsible for.

## Proposal 3: reduce code for metadata validation

I think that metadata validation should be very minimal. We control how the metadata is written (if writing) but also `zarr-python` takes care of writing metadata, so I don't think we need the overhead of something like a runtime typescript interface checker, but instead a small function that maybe checks a few fields for [compatibility in javascript](https://github.com/manzt/zarr-lite/blob/2d258e65bdf9ab40b0e3158d100d9e8f351cb89c/src/index.js#L56-L70).


## Proposal 4: Remove python-isms 

We should think deeply about translating native python-isms to native JavaScript alternative. For example, to implement a valid store, currently we need to throw a custom `KeyError` from `zarr` if a chunk is missing. This makes any store implementation require `zarr.js` as a dependency, and adds the addtional (confusing requirement) and that you also can't bundle that store as a separate dependency (`instanceof myCustomStore.KeyError !== instanceof zarr.KeyError`) if I bundle `myCustomStore` independently. I'd like to have the ability to:

```javascript
import { openArray } from 'https://cdn.skypack.dev/zarr';
import MyCustomStore from 'https://cdn.skypack.dev/@manzt/store'; // no dependency on `zarr`

const store = MyCustomStore();
const z = await openArray({ store });
```

A `KeyError` is a built in python error that's thrown when a key is missing from a `MutableMapping`. In JavaScript, the equivalent is to just return `undefined`. This would remove a lot of the `try`/`catch` blocks from the core implementation and instead we would just call:

```javascript
// Store returns 'undefined' instead of throwing custom error if chunk is missing.
// Otherwise the store can throw any other error and it will propagate up.
const chunk = await this.store.get(ckey);
if (!chunk) {
  /* create the missing chunk */ 
} else {
  /* decode the chunk */
}
``` 

As an aside, I also  feel that we should standardize store interfaces to be an extension of an ES6 Maps, but where are the functions are optionally async.

## Final thoughts

My apologies for the length of this issue, but these are some thoughts that I've had after using `zarr.js` on the web for some time. I wanted to note these ideas and my experiments with `zarrita.js` and `zarr-lite` here so that we could maybe incorporate some of those lessons learned into a leaner, easier-to-maintain, `zarr.js`. This is something I might have the time to work on in the future, but it would mean substantial changes to this repo in it's current state so I don't want to proceed with PRs until we've had a longer discussion.

 Excited to hear your thoughts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts towards a leaner zarr.js #69

Description

Proposal 1: fully tree-shakeable submodules

Core zarr features

Benefits of isolating core exports

Proposal 2: remove/move `RawArray` and `NestedArray` from `zarr.js`

Proposal 3: reduce code for metadata validation

Proposal 4: Remove python-isms

Final thoughts

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Thoughts towards a leaner zarr.js #69

Description

Description

Proposal 1: fully tree-shakeable submodules

Core zarr features

Benefits of isolating core exports

Proposal 2: remove/move RawArray and NestedArray from zarr.js

Proposal 3: reduce code for metadata validation

Proposal 4: Remove python-isms

Final thoughts

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Proposal 2: remove/move `RawArray` and `NestedArray` from `zarr.js`