Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions .claude/skills/bench-report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
name: bench-report
description: Generate an HTML benchmark comparison report from pytest-benchmark JSON files
user_invocable: true
argument_prompt: "Directory containing benchmark JSON files (default: .benchmarks/)"
---

Generate a self-contained HTML benchmark report with two views (Tables and Box Plots) from pytest-benchmark JSON files.

## Steps

1. Glob `{directory}/*.json` to find all benchmark result files. If the user provided no directory, use `.benchmarks/`.

2. Parse each filename to extract the store and version. Filenames follow the pattern `{store}_{version}_{version}_{timestamp}.json` where `{version}` appears twice. The store portion is everything before the first `pypi-` token. For example `s3_pypi-nightly_pypi-nightly_20260320-2224.json` has store=`s3` and version=`nightly`; `s3_ob_pypi-v1_pypi-v1_20260323-1959.json` has store=`s3-object-store` and version=`v1`. Rename `s3_ob` → `s3-object-store` for display. Also handle filenames from pytest-benchmark's `--benchmark-save` which look like `Linux-CPython-3.14-64bit/0001_s3_pypi-nightly_pypi-nightly.json`.

3. For each unique `{store}_{version}` combination, extract benchmark data — need full box-plot stats:
```
jq -s '[.[] | .benchmarks[] | {name: .name, group: .group, min: .stats.min, q1: .stats.q1, median: .stats.median, q3: .stats.q3, max: .stats.max, rounds: .stats.rounds}]'
```

4. Combine all extracted data into a single JSON object keyed by `{store}_{version}` (e.g. `s3_nightly`, `gcs_v1`, `s3-object-store_nightly`).

5. Read the HTML template from `.benchmarks/report.html`. Find the existing `const DATA = ...;` block and replace it with the new combined JSON. If the template doesn't exist, inform the user it needs to be created first.

6. Write the updated HTML back to `.benchmarks/report.html`.

7. Open the report in the browser with `open .benchmarks/report.html`.

## Report layout

The report is a single-page self-contained HTML file (no external dependencies). It has two top-level view modes toggled by prominent buttons: **Tables** and **Box Plots**.

### Shared controls (always visible)

- **Reads / Writes toggle** — splits benchmarks into reads (`test_time_*`) and writes (`test_write_*`, `test_set_*`). Everything re-renders based on selection.
- **Text search** filter
- **Group sections** — benchmarks grouped by their `group` field. For reads: Zarr (`zarr-read`), Xarray (`xarray-read`), Other (null group). For writes: Refs (`refs-write`), Other (null group). Each group gets a colored section/header row.

### Tables view

Contains sub-tabs:

- **Summary cards** with geometric mean ratios across stores and versions (scoped to current read/write selection)
- **v1 vs nightly tab** — compares versions per store with ratio bars. Group header rows.
- **S3 vs S3-object-store vs GCS tab** — compares stores with S3 as baseline (ob/s3 and gcs/s3 ratios). Group header rows.
- **All Results tab** — flat table with best values highlighted. Group header rows.
- **"Hide results within 15%" checkbox**
- **Ratio toggle**: flip between nightly/v1 and v1/nightly direction

### Box Plots view

- **Color-coded legend** at top — one color per `{store}_{version}` key. Fixed color map:
- `s3_v1`: blue, `s3_nightly`: green
- `s3-object-store_v1`: orange, `s3-object-store_nightly`: pink
- `gcs_v1`: purple, `gcs_nightly`: yellow
- **One card per benchmark** containing:
- Benchmark function name + labeled pills
- Horizontal box & whisker canvas plot with all store/version combos stacked vertically
- Shared x-axis with auto-scaled units (μs, ms, s) and grid lines
- Each box: whisker min→Q1, box Q1→Q3, thick median line, whisker Q3→max, caps
- Labels on left show `{store}_{version}` key

### Labeled colored pills (both views)

Each fixture parameter shown as `fixture: value` pill (e.g. `dataset: gb-8mb`, `preload: default`, `commit: True`). Distinct color per fixture type. Compound tokens kept as single pills using greedy longest-match against known tokens.

Token-to-fixture mapping: datasets (`gb-8mb`, `gb-128mb`, `large-manifest-*`, `simple-1d`, `large-1d`, `pancake-writes`), preload (`default`, `off`), selector (`single-chunk`, `full-read`), splitting (`no-splitting`, `split-size-10_000`), config (`default-inlined`, `not-inlined`), commit (`True`, `False`), executor (`threads`).
11 changes: 11 additions & 0 deletions Justfile
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,17 @@ samply *args:
chrome-trace *args:
ICECHUNK_TRACE=chrome cargo bench --features logs --bench main -- {{args}} --test

# install benchmark deps and build icechunk in release mode
bench-build:
cd icechunk-python && uv sync --group benchmark && env -u CONDA_PREFIX uv run maturin develop --uv --release

# create/refresh benchmark datasets (run once, or after format changes)
bench-setup *args='':
cd icechunk-python && uv run pytest -nauto -m --benchmark-disable setup_benchmarks benchmarks/ {{args}}

# run benchmarks (pass extra pytest args, e.g.: just bench "-k getsize")
bench *args='':
cd icechunk-python && uv run pytest --benchmark-autosave benchmarks/ {{args}}
[doc("Compare pytest-benchmark results")]
bench-compare *args:
pytest-benchmark compare --group=group,func,param --sort=fullname --columns=median --name=short "$@"
Expand Down
143 changes: 143 additions & 0 deletions icechunk-python/benchmarks/.claude/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Benchmarks CLAUDE.md

## Overview

Integration benchmarks exercising the Xarray/Zarr/Icechunk stack end-to-end, built on `pytest-benchmark`.

## Quick Reference

```bash
just bench-build # uv sync + maturin develop --release
just bench-setup # create datasets (once, ~3 min)
just bench # run all benchmarks
just bench "-k getsize" # run specific benchmarks
just bench-compare 0020 0021 # compare saved runs
```

All commands run from the repo root. Under the hood they `cd icechunk-python` and use `uv run`.

## File Map

| File | Purpose |
|---|---|
| `test_benchmark_reads.py` | Read benchmarks (store open, getsize, zarr/xarray open, chunk reads, first-byte) |
| `test_benchmark_writes.py` | Write benchmarks (task-based writes, 1D writes, chunk refs, virtual refs, split manifests) |
| `datasets.py` | Dataset definitions (`BenchmarkReadDataset`, `BenchmarkWriteDataset`, `IngestDataset`) and setup functions |
| `conftest.py` | Pytest fixtures, custom options (`--where`, `--icechunk-prefix`, `--force-setup`), markers |
| `runner.py` | Multi-version orchestration: clones repo, builds each ref, runs setup + benchmarks, compares |
| `tasks.py` | Low-level concurrent write tasks using `ForkSession` with thread/process pool executors |
| `helpers.py` | Utilities: logger, coiled kwargs, git commit resolution, `repo_config_with()`, splitting config |
| `lib.py` | Math/timing: `stats()`, `slices_from_chunks()`, `normalize_chunks()`, `Timer` context manager |
| `create_era5.py` | ERA5 dataset creation using Coiled + Dask (separate from `setup_benchmarks` due to cost) |

## Architecture

### Datasets (`datasets.py`)

`StorageConfig` wraps bucket/prefix/region and constructs `ic.Storage` objects. `Dataset` wraps a `StorageConfig` and provides `create()` (with optional `clear`) and a `.store` property.

Two specialized subclasses:
- **`BenchmarkReadDataset`** — adds `load_variables`, `chunk_selector`, `full_load_selector`, `first_byte_variable`, `setupfn`
- **`BenchmarkWriteDataset`** — adds `num_arrays`, `shape`, `chunks`

Predefined datasets:

| Name | Type | Description |
|---|---|---|
| `ERA5_SINGLE` | Read | Single NCAR ERA5 netCDF (~17k chunks) |
| `ERA5_ARCO` | Read | ARCO-ERA5 from GCP (metadata only, no data arrays written) |
| `GB_8MB_CHUNKS` | Read | 512^3 int64 array, 4x512x512 chunks |
| `GB_128MB_CHUNKS` | Read | 512^3 int64 array, 64x512x512 chunks |
| `LARGE_MANIFEST_UNSHARDED` | Read | 500M x 1000 array, no manifest splitting |
| `LARGE_MANIFEST_SHARDED` | Read | 500M x 1000 array, split_size=100k |
| `PANCAKE_WRITES` | Write | 320x720x1441, chunks=(1,-1,-1) |
| `SIMPLE_1D` | Write | 2M elements, chunks=1000 |
| `LARGE_1D` | Write | 500M elements, chunks=1000 |

### Storage Targets

Controlled by `--where` flag. Buckets defined in `TEST_BUCKETS` dict:

| Store | Bucket | Region |
|---|---|---|
| `local` | platformdirs cache | - |
| `s3` | `icechunk-ci` | us-east-1 |
| `s3_ob` | (same as s3, uses `s3_object_store_storage`) | us-east-1 |
| `gcs` | `icechunk-test-gcp` | us-east1 |
| `tigris` | `icechunk-test` | iad |
| `r2` | `icechunk-test-r2` | us-east-1 |

### Pytest Markers

- `@pytest.mark.setup_benchmarks` — dataset creation (run with `-m setup_benchmarks`)
- `@pytest.mark.read_benchmark` — all read tests
- `@pytest.mark.write_benchmark` — all write tests

### Fixtures (`conftest.py`)

- `synth_dataset` — parameterized over read datasets (currently large-manifest-no-split, large-manifest-split)
- `synth_write_dataset` — PANCAKE_WRITES
- `simple_write_dataset` — SIMPLE_1D
- `large_write_dataset` — LARGE_1D
- `repo` — local tmpdir repo with virtual chunk container configured

`request_to_dataset()` applies `--where` and `--icechunk-prefix` to any dataset fixture.

### Runner (`runner.py`)

Multi-version benchmarking orchestrator. Two runner classes:

- **`LocalRunner`** — clones repo to `/tmp/icechunk-bench-{commit}`, runs `uv sync --group benchmark`, `maturin develop --release`, copies benchmarks/ from CWD, executes via `uv run`
- **`CoiledRunner`** — creates Coiled software environments, runs on cloud VMs (m5.4xlarge / n2-standard-16), syncs results via S3

Usage: `python benchmarks/runner.py [--where local|s3|gcs] [--setup force|skip] [--pytest "-k pattern"] ref1 ref2 ...`

Datasets written to `{bucket}/benchmarks/{ref}_{shortcommit}/`.

### Task-Based Writes (`tasks.py`)

Uses `ForkSession` for concurrent writes:
1. Create tasks with `ForkSession` + region slices
2. Submit to thread/process pool
3. Each worker writes a chunk region via zarr
4. Merge all `ForkSession`s back into parent session
5. Commit

## Read Benchmarks (`test_benchmark_reads.py`)

| Test | What It Measures |
|---|---|
| `test_time_create_store` | Repository.open + readonly_session + store creation |
| `test_time_getsize_key` | `store.getsize(key)` for zarr.json metadata keys |
| `test_time_getsize_prefix` | `array.nbytes_stored()` (prefix-based size aggregation) |
| `test_time_zarr_open` | Cold `zarr.open_group` (re-downloads snapshot each round) |
| `test_time_zarr_members` | `group.members()` enumeration |
| `test_time_xarray_open` | `xr.open_zarr` with `chunks=None, consolidated=False` |
| `test_time_xarray_read_chunks_cold_cache` | Full open + isel + compute (parameterized: single-chunk vs full-read, with preload fixture) |
| `test_time_xarray_read_chunks_hot_cache` | Repeated compute on pre-opened dataset (measures chunk fetch, not metadata) |
| `test_time_first_bytes` | Open group + read coordinate array (sensitive to manifest splitting) |

The `preload` fixture parameterizes `ManifestPreloadConfig` (default vs off).

## Write Benchmarks (`test_benchmark_writes.py`)

| Test | What It Measures |
|---|---|
| `test_write_chunks_with_tasks` | Concurrent task-based writes (ThreadPool/ProcessPool), captures per-task timings |
| `test_write_simple_1d` | Simple array write + commit cycle (good for comparing S3 vs GCS latency) |
| `test_write_many_chunk_refs` | Writing 10k chunk refs, parameterized: inlined vs not, committed vs not |
| `test_set_many_virtual_chunk_refs` | Setting 100k virtual chunk refs via `store.set_virtual_ref()` |
| `test_write_split_manifest_refs_full_rewrite` | Commit time for 500k virtual refs (full rewrite), parameterized by splitting |
| `test_write_split_manifest_refs_append` | Commit time for incremental appends of virtual refs, 10 rounds |

The `splitting` fixture parameterizes `ManifestSplittingConfig` (None vs split_size=10000).

## Key Patterns

- **Benchmark results** saved to `.benchmarks/` as JSON via `--benchmark-autosave` or `--benchmark-save=NAME`
- **Comparing runs**: `pytest-benchmark compare 0020 0021 --group=func,param --columns=median --name=short`
- **Cold vs hot cache**: cold benchmarks re-create store/repo inside the benchmarked function; hot benchmarks create once outside
- **`pedantic` mode**: used in task writes and split-manifest tests for finer control (`setup=` callback, explicit `rounds`/`iterations`)
- **`benchmark.extra_info`**: task-based writes record per-task timing statistics in the JSON output
- **zarr async concurrency**: set to 64 globally in read benchmarks, configurable in writes
- **Version compatibility**: `conftest.py` uses `pytest_configure` hook instead of `pyproject.toml` markers to support older icechunk versions
7 changes: 7 additions & 0 deletions icechunk-python/benchmarks/.claude/settings.local.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"permissions": {
"allow": [
"Bash(just bench-build:*)"
]
}
}
Loading
Loading