earth-mover · dcherian · Feb 13, 2026 · Feb 13, 2026 · Feb 13, 2026 · Feb 13, 2026
diff --git a/.claude/skills/bench-report.md b/.claude/skills/bench-report.md
@@ -0,0 +1,67 @@
+---
+name: bench-report
+description: Generate an HTML benchmark comparison report from pytest-benchmark JSON files
+user_invocable: true
+argument_prompt: "Directory containing benchmark JSON files (default: .benchmarks/)"
+---
+
+Generate a self-contained HTML benchmark report with two views (Tables and Box Plots) from pytest-benchmark JSON files.
+
+## Steps
+
+1. Glob `{directory}/*.json` to find all benchmark result files. If the user provided no directory, use `.benchmarks/`.
+
+2. Parse each filename to extract the store and version. Filenames follow the pattern `{store}_{version}_{version}_{timestamp}.json` where `{version}` appears twice. The store portion is everything before the first `pypi-` token. For example `s3_pypi-nightly_pypi-nightly_20260320-2224.json` has store=`s3` and version=`nightly`; `s3_ob_pypi-v1_pypi-v1_20260323-1959.json` has store=`s3-object-store` and version=`v1`. Rename `s3_ob` → `s3-object-store` for display. Also handle filenames from pytest-benchmark's `--benchmark-save` which look like `Linux-CPython-3.14-64bit/0001_s3_pypi-nightly_pypi-nightly.json`.
+
+3. For each unique `{store}_{version}` combination, extract benchmark data — need full box-plot stats:
+   ```
+   jq -s '[.[] | .benchmarks[] | {name: .name, group: .group, min: .stats.min, q1: .stats.q1, median: .stats.median, q3: .stats.q3, max: .stats.max, rounds: .stats.rounds}]'
+   ```
+
+4. Combine all extracted data into a single JSON object keyed by `{store}_{version}` (e.g. `s3_nightly`, `gcs_v1`, `s3-object-store_nightly`).
+
+5. Read the HTML template from `.benchmarks/report.html`. Find the existing `const DATA = ...;` block and replace it with the new combined JSON. If the template doesn't exist, inform the user it needs to be created first.
+
+6. Write the updated HTML back to `.benchmarks/report.html`.
+
+7. Open the report in the browser with `open .benchmarks/report.html`.
+
+## Report layout
+
+The report is a single-page self-contained HTML file (no external dependencies). It has two top-level view modes toggled by prominent buttons: **Tables** and **Box Plots**.
+
+### Shared controls (always visible)
+
+- **Reads / Writes toggle** — splits benchmarks into reads (`test_time_*`) and writes (`test_write_*`, `test_set_*`). Everything re-renders based on selection.
+- **Text search** filter
+- **Group sections** — benchmarks grouped by their `group` field. For reads: Zarr (`zarr-read`), Xarray (`xarray-read`), Other (null group). For writes: Refs (`refs-write`), Other (null group). Each group gets a colored section/header row.
+
+### Tables view
+
+Contains sub-tabs:
+
+- **Summary cards** with geometric mean ratios across stores and versions (scoped to current read/write selection)
+- **v1 vs nightly tab** — compares versions per store with ratio bars. Group header rows.
+- **S3 vs S3-object-store vs GCS tab** — compares stores with S3 as baseline (ob/s3 and gcs/s3 ratios). Group header rows.
+- **All Results tab** — flat table with best values highlighted. Group header rows.
+- **"Hide results within 15%" checkbox**
+- **Ratio toggle**: flip between nightly/v1 and v1/nightly direction
+
+### Box Plots view
+
+- **Color-coded legend** at top — one color per `{store}_{version}` key. Fixed color map:
+  - `s3_v1`: blue, `s3_nightly`: green
+  - `s3-object-store_v1`: orange, `s3-object-store_nightly`: pink
+  - `gcs_v1`: purple, `gcs_nightly`: yellow
+- **One card per benchmark** containing:
+  - Benchmark function name + labeled pills
+  - Horizontal box & whisker canvas plot with all store/version combos stacked vertically
+  - Shared x-axis with auto-scaled units (μs, ms, s) and grid lines
+  - Each box: whisker min→Q1, box Q1→Q3, thick median line, whisker Q3→max, caps
+  - Labels on left show `{store}_{version}` key
+
+### Labeled colored pills (both views)
+
+Each fixture parameter shown as `fixture: value` pill (e.g. `dataset: gb-8mb`, `preload: default`, `commit: True`). Distinct color per fixture type. Compound tokens kept as single pills using greedy longest-match against known tokens.
+
+Token-to-fixture mapping: datasets (`gb-8mb`, `gb-128mb`, `large-manifest-*`, `simple-1d`, `large-1d`, `pancake-writes`), preload (`default`, `off`), selector (`single-chunk`, `full-read`), splitting (`no-splitting`, `split-size-10_000`), config (`default-inlined`, `not-inlined`), commit (`True`, `False`), executor (`threads`).
diff --git a/Justfile b/Justfile
@@ -100,6 +100,17 @@ samply *args:
 chrome-trace *args:
   ICECHUNK_TRACE=chrome cargo bench --features logs --bench main -- {{args}} --test
 
+# install benchmark deps and build icechunk in release mode
+bench-build:
+  cd icechunk-python && uv sync --group benchmark && env -u CONDA_PREFIX uv run maturin develop --uv --release
+
+# create/refresh benchmark datasets (run once, or after format changes)
+bench-setup *args='':
+  cd icechunk-python && uv run pytest -nauto -m --benchmark-disable setup_benchmarks benchmarks/ {{args}}
+
+# run benchmarks (pass extra pytest args, e.g.: just bench "-k getsize")
+bench *args='':
+  cd icechunk-python && uv run pytest --benchmark-autosave benchmarks/ {{args}}
 [doc("Compare pytest-benchmark results")]
 bench-compare *args:
   pytest-benchmark compare --group=group,func,param --sort=fullname --columns=median --name=short "$@"

diff --git a/icechunk-python/benchmarks/.claude/CLAUDE.md b/icechunk-python/benchmarks/.claude/CLAUDE.md
@@ -0,0 +1,143 @@
+# Benchmarks CLAUDE.md
+
+## Overview
+
+Integration benchmarks exercising the Xarray/Zarr/Icechunk stack end-to-end, built on `pytest-benchmark`.
+
+## Quick Reference
+
+```bash
+just bench-build                  # uv sync + maturin develop --release
+just bench-setup                  # create datasets (once, ~3 min)
+just bench                        # run all benchmarks
+just bench "-k getsize"           # run specific benchmarks
+just bench-compare 0020 0021      # compare saved runs
+```
+
+All commands run from the repo root. Under the hood they `cd icechunk-python` and use `uv run`.
+
+## File Map
+
+| File | Purpose |
+|---|---|
+| `test_benchmark_reads.py` | Read benchmarks (store open, getsize, zarr/xarray open, chunk reads, first-byte) |
+| `test_benchmark_writes.py` | Write benchmarks (task-based writes, 1D writes, chunk refs, virtual refs, split manifests) |
+| `datasets.py` | Dataset definitions (`BenchmarkReadDataset`, `BenchmarkWriteDataset`, `IngestDataset`) and setup functions |
+| `conftest.py` | Pytest fixtures, custom options (`--where`, `--icechunk-prefix`, `--force-setup`), markers |
+| `runner.py` | Multi-version orchestration: clones repo, builds each ref, runs setup + benchmarks, compares |
+| `tasks.py` | Low-level concurrent write tasks using `ForkSession` with thread/process pool executors |
+| `helpers.py` | Utilities: logger, coiled kwargs, git commit resolution, `repo_config_with()`, splitting config |
+| `lib.py` | Math/timing: `stats()`, `slices_from_chunks()`, `normalize_chunks()`, `Timer` context manager |
+| `create_era5.py` | ERA5 dataset creation using Coiled + Dask (separate from `setup_benchmarks` due to cost) |
+
+## Architecture
+
+### Datasets (`datasets.py`)
+
+`StorageConfig` wraps bucket/prefix/region and constructs `ic.Storage` objects. `Dataset` wraps a `StorageConfig` and provides `create()` (with optional `clear`) and a `.store` property.
+
+Two specialized subclasses:
+- **`BenchmarkReadDataset`** — adds `load_variables`, `chunk_selector`, `full_load_selector`, `first_byte_variable`, `setupfn`
+- **`BenchmarkWriteDataset`** — adds `num_arrays`, `shape`, `chunks`
+
+Predefined datasets:
+
+| Name | Type | Description |
+|---|---|---|
+| `ERA5_SINGLE` | Read | Single NCAR ERA5 netCDF (~17k chunks) |
+| `ERA5_ARCO` | Read | ARCO-ERA5 from GCP (metadata only, no data arrays written) |
+| `GB_8MB_CHUNKS` | Read | 512^3 int64 array, 4x512x512 chunks |
+| `GB_128MB_CHUNKS` | Read | 512^3 int64 array, 64x512x512 chunks |
+| `LARGE_MANIFEST_UNSHARDED` | Read | 500M x 1000 array, no manifest splitting |
+| `LARGE_MANIFEST_SHARDED` | Read | 500M x 1000 array, split_size=100k |
+| `PANCAKE_WRITES` | Write | 320x720x1441, chunks=(1,-1,-1) |
+| `SIMPLE_1D` | Write | 2M elements, chunks=1000 |
+| `LARGE_1D` | Write | 500M elements, chunks=1000 |
+
+### Storage Targets
+
+Controlled by `--where` flag. Buckets defined in `TEST_BUCKETS` dict:
+
+| Store | Bucket | Region |
+|---|---|---|
+| `local` | platformdirs cache | - |
+| `s3` | `icechunk-ci` | us-east-1 |
+| `s3_ob` | (same as s3, uses `s3_object_store_storage`) | us-east-1 |
+| `gcs` | `icechunk-test-gcp` | us-east1 |
+| `tigris` | `icechunk-test` | iad |
+| `r2` | `icechunk-test-r2` | us-east-1 |
+
+### Pytest Markers
+
+- `@pytest.mark.setup_benchmarks` — dataset creation (run with `-m setup_benchmarks`)
+- `@pytest.mark.read_benchmark` — all read tests
+- `@pytest.mark.write_benchmark` — all write tests
+
+### Fixtures (`conftest.py`)
+
+- `synth_dataset` — parameterized over read datasets (currently large-manifest-no-split, large-manifest-split)
+- `synth_write_dataset` — PANCAKE_WRITES
+- `simple_write_dataset` — SIMPLE_1D
+- `large_write_dataset` — LARGE_1D
+- `repo` — local tmpdir repo with virtual chunk container configured
+
+`request_to_dataset()` applies `--where` and `--icechunk-prefix` to any dataset fixture.
+
+### Runner (`runner.py`)
+
+Multi-version benchmarking orchestrator. Two runner classes:
+
+- **`LocalRunner`** — clones repo to `/tmp/icechunk-bench-{commit}`, runs `uv sync --group benchmark`, `maturin develop --release`, copies benchmarks/ from CWD, executes via `uv run`
+- **`CoiledRunner`** — creates Coiled software environments, runs on cloud VMs (m5.4xlarge / n2-standard-16), syncs results via S3
+
+Usage: `python benchmarks/runner.py [--where local|s3|gcs] [--setup force|skip] [--pytest "-k pattern"] ref1 ref2 ...`
+
+Datasets written to `{bucket}/benchmarks/{ref}_{shortcommit}/`.
+
+### Task-Based Writes (`tasks.py`)
+
+Uses `ForkSession` for concurrent writes:
+1. Create tasks with `ForkSession` + region slices
+2. Submit to thread/process pool
+3. Each worker writes a chunk region via zarr
+4. Merge all `ForkSession`s back into parent session
+5. Commit
+
+## Read Benchmarks (`test_benchmark_reads.py`)
+
+| Test | What It Measures |
+|---|---|
+| `test_time_create_store` | Repository.open + readonly_session + store creation |
+| `test_time_getsize_key` | `store.getsize(key)` for zarr.json metadata keys |
+| `test_time_getsize_prefix` | `array.nbytes_stored()` (prefix-based size aggregation) |
+| `test_time_zarr_open` | Cold `zarr.open_group` (re-downloads snapshot each round) |
+| `test_time_zarr_members` | `group.members()` enumeration |
+| `test_time_xarray_open` | `xr.open_zarr` with `chunks=None, consolidated=False` |
+| `test_time_xarray_read_chunks_cold_cache` | Full open + isel + compute (parameterized: single-chunk vs full-read, with preload fixture) |
+| `test_time_xarray_read_chunks_hot_cache` | Repeated compute on pre-opened dataset (measures chunk fetch, not metadata) |
+| `test_time_first_bytes` | Open group + read coordinate array (sensitive to manifest splitting) |
+
+The `preload` fixture parameterizes `ManifestPreloadConfig` (default vs off).
+
+## Write Benchmarks (`test_benchmark_writes.py`)
+
+| Test | What It Measures |
+|---|---|
+| `test_write_chunks_with_tasks` | Concurrent task-based writes (ThreadPool/ProcessPool), captures per-task timings |
+| `test_write_simple_1d` | Simple array write + commit cycle (good for comparing S3 vs GCS latency) |
+| `test_write_many_chunk_refs` | Writing 10k chunk refs, parameterized: inlined vs not, committed vs not |
+| `test_set_many_virtual_chunk_refs` | Setting 100k virtual chunk refs via `store.set_virtual_ref()` |
+| `test_write_split_manifest_refs_full_rewrite` | Commit time for 500k virtual refs (full rewrite), parameterized by splitting |
+| `test_write_split_manifest_refs_append` | Commit time for incremental appends of virtual refs, 10 rounds |
+
+The `splitting` fixture parameterizes `ManifestSplittingConfig` (None vs split_size=10000).
+
+## Key Patterns
+
+- **Benchmark results** saved to `.benchmarks/` as JSON via `--benchmark-autosave` or `--benchmark-save=NAME`
+- **Comparing runs**: `pytest-benchmark compare 0020 0021 --group=func,param --columns=median --name=short`
+- **Cold vs hot cache**: cold benchmarks re-create store/repo inside the benchmarked function; hot benchmarks create once outside
+- **`pedantic` mode**: used in task writes and split-manifest tests for finer control (`setup=` callback, explicit `rounds`/`iterations`)
+- **`benchmark.extra_info`**: task-based writes record per-task timing statistics in the JSON output
+- **zarr async concurrency**: set to 64 globally in read benchmarks, configurable in writes
+- **Version compatibility**: `conftest.py` uses `pytest_configure` hook instead of `pyproject.toml` markers to support older icechunk versions
diff --git a/icechunk-python/benchmarks/.claude/settings.local.json b/icechunk-python/benchmarks/.claude/settings.local.json
@@ -0,0 +1,7 @@
+{
+  "permissions": {
+    "allow": [
+      "Bash(just bench-build:*)"
+    ]
+  }
+}