Skip to content

Commit 897914b

Browse files
lazappikeller-markArtur-manrcannood
authored
Add Zarr to benchmarks (#446)
* WIP * More tests passing * Fix df read bug * More tests passing after fixing zero-dimensional get bug in pizzarr * WIP: writing * Fix more tests * Zarr df writing * WIP: ZarrAnnData class * Tests passing * Tests that compare h5ad to zarr * Use Rarr to read full numeric arrays * Fix bugs. Add test for from_SingleCellExperiment with Zarr * Add a to_dense param to ZarrAnnData constructor. Add overwrite params internally. * Update * Backwards dense/sparse * Simplify how obs and var names handled in ZarrAnnData (similar to #171) * update extdata and documentation * fix set/get zarr _index, update text example.zarr and update tests similar to HDF5AnnData * Fix test * Revert unnecessary changes * Formatting * Add comments * remove unnecessary example zarr store * lintr and R check for zarr related utilities and functions, updated some documentation * add pizzarr to Suggests and README * proj * add keller-mark/pizzarr to Remotes * zip example.zarr * adapt read_zarr to Rarr * adapt write_zarr to Rarr * remove old scripts * update write_zarr * initial update to ZarrAnnData * update ZarrAnnData, documentation, and implement read_zarr_rec_array * review read zarr helpers, and update tests * update read_zarr, read tests pass * some updates for writing zarr * update write_empty_zarr * remove pizzarr, update documentation * remove pizzarr from tests * fix test-ZarrAnnData * update ZarrAnnData to imitate HDF5AnnData * check redundant files, correct lines * update example_h5ad.py, add zarr and change to example_files.py * add new test example * some linting changes * remove read/write_zattrs since implemented in Rarr * access read/write_zarr_attr * add some missing tests * update readers, update tests * correct nullable string zarr array write/read, introduce ordering in categorical zarr array * do some linting, fix commented out code * update some zarr writers and classes * fix documentation * fix compression interface for zarr * full lint check * fix examples * check, biocheck and lintr * fix development status * air format * air format test * update example.zarr.zip, skip some test (waiting for Rarr) * update example.zarr, fix some read_zarr_ * fix examples * remove overwrite * R code styling * fixes from @lazappi * air format * update some documentation * fix some tests * more fixes on anndata-zarr integration * update ZarrAnnData$initialize * update zarr compression * fix column-order here, C based ordering for arrays * implement roundtrip tests for anndata-zarr * add zarr to vignettes * update README and software_design.rmd * update AnnData-usage * update write_zarr documentation * update write_zarr_null * fix rec_array, update tests and example datasets * fix duplicate chunks in Rmd * add write_zarr_null * update write string array (zarr), air and lint * implement writing empty zarr elements * update tests for rec_array conformance of h5ad and zarr * update mapping conformance test for h5ad and zarr * implement H5_ITER like ordering and fix h5ad vs zarr testing * air and lint * fix test bug * do not call expect_equal outside of test * implement examples, test and datasets for zarr v3 * fix issues, lint and update example datasets to new anndata version * lint and merge * revert some lines * small changes * revert small changes * air format some tests * Set v2 in write_zarr_* helpers * Fix stop message in write_zarr_element() * Fix roxygen comment in write_zarr_element() * Expand compression list in as_ZarrAnnData() * Fix H5_ITER_INC_ORDERING docs * Fix as_ZarrAnnData() compression docs * Fix Zarr varm roundtrip test * Review duplicate entry in README * Fix typo in AnnData-usage docs * Fix comma in software design vignette * Adjust class descriptions in software design vignette * Roxygenise * Minor text fixes * Document .get_compressor * Minor fixes to function docs * Comment logic in create_zarr_group() * Fix indentation in read_zarr_sparse_array() * Add construct sparse matrix helper * Add ZARR_METADATA_FILES vector * Eval Zarr chunks in vignettes * Merge test-Zarrv3-read.R into test-Zarr-read.R * Combine roundtrip tests * Add roundtrip test helpers * Refactor test-h5ad-zarr.R to use helper * Refactor example files script - Use uv comments for dependencies - Update dependencies to latest stable versions - Add progress messages - Use Ruff for formatting * Remove H5_ITER_INC_ORDERING() * Remove Zarr compression comment * Fix factor creation in read_zarr_categorical() * Pin Rarr version * Delete existing Zarr path before writing * Add helper functions for accessing Zarr keys * Update read_zarr_element() error message * Add dimname warnings to ZarrAnnData * Add Zarr writeability checks/tests * Roxygenise, lint, style * Use setup-bioc for all GHA * Update WORDLIST * Add .venv to .Rbuildignore * Clean up test output * Add Zarr to benchmarks * run air format --------- Co-authored-by: Mark Keller <7525285+keller-mark@users.noreply.github.com> Co-authored-by: Artur-man <artur-man@hotmail.com> Co-authored-by: Robrecht Cannoodt <rcannood@gmail.com>
1 parent 4269af5 commit 897914b

8 files changed

Lines changed: 214 additions & 24 deletions

File tree

benchmarks/lib/helpers.R

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,23 @@ generate_bench_h5ad <- function(x_type, n_obs, n_vars, cache_dir) {
8383
path
8484
}
8585

86+
#' Convert an H5AD bench file to a Zarr store and cache it
87+
#'
88+
#' @param x_type Matrix type key (matches h5ad_paths names)
89+
#' @param h5ad_path Path to the corresponding H5AD file
90+
#' @param cache_dir Directory to cache generated stores
91+
#' @return Path to the generated Zarr store directory
92+
generate_bench_zarr <- function(x_type, h5ad_path, cache_dir) {
93+
path <- file.path(cache_dir, paste0("bench_", x_type, ".zarr"))
94+
if (dir.exists(path)) {
95+
return(path)
96+
}
97+
ad <- reticulate::import("anndata", convert = FALSE)
98+
adata_py <- ad$read_h5ad(h5ad_path)
99+
adata_py$write_zarr(path)
100+
path
101+
}
102+
86103
# ---------------------------------------------------------------------------
87104
# bench::mark → BMF JSON conversion
88105
# ---------------------------------------------------------------------------

benchmarks/run_benchmarks.R

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,22 @@ h5ad_paths <- setNames(
7373
)
7474
cat("\n")
7575

76+
cat("Generating Zarr test data (converting from H5AD)...\n")
77+
zarr_paths <- setNames(
78+
vapply(
79+
x_types,
80+
function(xt) {
81+
cat(sprintf(" %s... ", xt))
82+
path <- generate_bench_zarr(xt, h5ad_paths[[xt]], cache_dir)
83+
cat("done\n")
84+
path
85+
},
86+
character(1)
87+
),
88+
x_types
89+
)
90+
cat("\n")
91+
7692
# ---------------------------------------------------------------------------
7793
# Run selected suites
7894
# ---------------------------------------------------------------------------
@@ -88,12 +104,12 @@ for (suite in suites_to_run) {
88104

89105
suite_results <- switch(
90106
suite,
91-
read = bench_read(h5ad_paths, opts$iterations, x_types),
92-
write = bench_write(h5ad_paths, opts$iterations, x_types),
93-
get = bench_get(h5ad_paths, opts$iterations),
94-
set = bench_set(h5ad_paths, opts$iterations),
95-
convert = bench_convert(h5ad_paths, opts$iterations, x_types),
96-
subset = bench_subset(h5ad_paths, opts$iterations),
107+
read = bench_read(h5ad_paths, opts$iterations, x_types, zarr_paths),
108+
write = bench_write(h5ad_paths, opts$iterations, x_types, zarr_paths),
109+
get = bench_get(h5ad_paths, opts$iterations, zarr_paths),
110+
set = bench_set(h5ad_paths, opts$iterations, zarr_paths),
111+
convert = bench_convert(h5ad_paths, opts$iterations, x_types, zarr_paths),
112+
subset = bench_subset(h5ad_paths, opts$iterations, zarr_paths),
97113
{
98114
warning("Unknown suite: ", suite)
99115
list()

benchmarks/suites/bench_convert.R

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
# format conversions (InMemory↔SCE, InMemory↔Seurat).
66
# =============================================================================
77

8-
bench_convert <- function(h5ad_paths, iterations, x_types) {
8+
bench_convert <- function(h5ad_paths, iterations, x_types, zarr_paths) {
99
results <- list()
1010

1111
# --- Backend conversions (per X type) ---
@@ -47,6 +47,43 @@ bench_convert <- function(h5ad_paths, iterations, x_types) {
4747
)
4848
}
4949

50+
# --- Zarr ↔ InMemory conversions (per X type) ---
51+
for (xt in x_types) {
52+
zarr_path <- zarr_paths[[xt]]
53+
54+
# Zarr → InMemory
55+
env <- new.env(parent = globalenv())
56+
env$.ad <- read_zarr(zarr_path, as = "ZarrAnnData")
57+
58+
results <- c(
59+
results,
60+
run_one_benchmark(
61+
name = paste0("convert_Zarr_to_InMemory_", xt),
62+
expr = quote(.ad$as_InMemoryAnnData()),
63+
iterations = iterations,
64+
env = env
65+
)
66+
)
67+
68+
# InMemory → Zarr
69+
env2 <- new.env(parent = globalenv())
70+
env2$.ad <- read_zarr(zarr_path, as = "InMemoryAnnData")
71+
72+
results <- c(
73+
results,
74+
run_one_benchmark(
75+
name = paste0("convert_InMemory_to_Zarr_", xt),
76+
expr = quote({
77+
.tmp <- tempfile()
78+
.result <- .ad$as_ZarrAnnData(.tmp)
79+
unlink(.tmp, recursive = TRUE)
80+
}),
81+
iterations = iterations,
82+
env = env2
83+
)
84+
)
85+
}
86+
5087
# --- Format conversions (using float_csparse as representative) ---
5188
path <- h5ad_paths[["float_csparse"]]
5289
ad <- read_h5ad(path, as = "InMemoryAnnData")

benchmarks/suites/bench_get.R

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,15 +42,24 @@
4242
colnames = quote(colnames(.ad))
4343
)
4444

45-
bench_get <- function(h5ad_paths, iterations) {
45+
bench_get <- function(h5ad_paths, iterations, zarr_paths) {
4646
results <- list()
4747
path <- h5ad_paths[["float_csparse"]]
4848

49-
for (backend in c("InMemoryAnnData", "HDF5AnnData")) {
50-
short <- if (backend == "InMemoryAnnData") "InMemory" else "HDF5"
49+
for (backend in c("InMemoryAnnData", "HDF5AnnData", "ZarrAnnData")) {
50+
short <- switch(
51+
backend,
52+
InMemoryAnnData = "InMemory",
53+
HDF5AnnData = "HDF5",
54+
ZarrAnnData = "Zarr"
55+
)
5156

5257
# Open the AnnData
53-
ad <- read_h5ad(path, as = backend)
58+
ad <- if (backend == "ZarrAnnData") {
59+
read_zarr(zarr_paths[["float_csparse"]], as = "ZarrAnnData")
60+
} else {
61+
read_h5ad(path, as = backend)
62+
}
5463

5564
# --- Slot getters ---
5665
for (slot in .bench_slots) {

benchmarks/suites/bench_read.R

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
# across different X matrix types.
66
# =============================================================================
77

8-
bench_read <- function(h5ad_paths, iterations, x_types) {
8+
bench_read <- function(h5ad_paths, iterations, x_types, zarr_paths) {
99
results <- list()
1010

1111
for (xt in x_types) {
@@ -37,5 +37,32 @@ bench_read <- function(h5ad_paths, iterations, x_types) {
3737
)
3838
}
3939

40+
# Read from Zarr store
41+
for (xt in x_types) {
42+
path <- zarr_paths[[xt]]
43+
44+
# Read Zarr → InMemoryAnnData
45+
results <- c(
46+
results,
47+
run_one_benchmark(
48+
name = paste0("read_zarr_InMemory_", xt),
49+
expr = quote(read_zarr(.path, as = "InMemoryAnnData")),
50+
setup = bquote(.path <- .(path)),
51+
iterations = iterations
52+
)
53+
)
54+
55+
# Open Zarr lazily → ZarrAnnData
56+
results <- c(
57+
results,
58+
run_one_benchmark(
59+
name = paste0("read_zarr_Zarr_", xt),
60+
expr = quote(read_zarr(.path, as = "ZarrAnnData")),
61+
setup = bquote(.path <- .(path)),
62+
iterations = iterations
63+
)
64+
)
65+
}
66+
4067
results
4168
}

benchmarks/suites/bench_set.R

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
# Benchmarks setting every AnnData slot on both InMemory and HDF5 backends.
55
# =============================================================================
66

7-
bench_set <- function(h5ad_paths, iterations) {
7+
bench_set <- function(h5ad_paths, iterations, zarr_paths) {
88
results <- list()
99
path <- h5ad_paths[["float_csparse"]]
1010

@@ -22,12 +22,25 @@ bench_set <- function(h5ad_paths, iterations) {
2222
"uns"
2323
)
2424

25-
for (backend in c("InMemoryAnnData", "HDF5AnnData")) {
26-
short <- if (backend == "InMemoryAnnData") "InMemory" else "HDF5"
25+
for (backend in c("InMemoryAnnData", "HDF5AnnData", "ZarrAnnData")) {
26+
short <- switch(
27+
backend,
28+
InMemoryAnnData = "InMemory",
29+
HDF5AnnData = "HDF5",
30+
ZarrAnnData = "Zarr"
31+
)
2732

2833
for (slot in slots) {
29-
# For HDF5, we need a fresh writable copy for each slot
30-
if (backend == "HDF5AnnData") {
34+
# Each backend needs a fresh writable instance per slot
35+
if (backend == "ZarrAnnData") {
36+
# Copy Zarr store directory so each slot gets a fresh writable copy
37+
zarr_path <- zarr_paths[["float_csparse"]]
38+
tmp_parent <- tempfile()
39+
dir.create(tmp_parent, recursive = TRUE)
40+
file.copy(zarr_path, tmp_parent, recursive = TRUE)
41+
tmp <- file.path(tmp_parent, basename(zarr_path))
42+
ad <- read_zarr(tmp, as = "ZarrAnnData", mode = "r+")
43+
} else if (backend == "HDF5AnnData") {
3144
tmp <- tempfile(fileext = ".h5ad")
3245
file.copy(path, tmp)
3346
ad <- suppressWarnings(
@@ -55,7 +68,9 @@ bench_set <- function(h5ad_paths, iterations) {
5568
)
5669
)
5770

58-
if (backend == "HDF5AnnData") {
71+
if (backend == "ZarrAnnData") {
72+
unlink(tmp, recursive = TRUE)
73+
} else if (backend == "HDF5AnnData") {
5974
ad$close()
6075
unlink(tmp)
6176
}

benchmarks/suites/bench_subset.R

Lines changed: 29 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,22 @@
55
# materialization back to concrete implementations.
66
# =============================================================================
77

8-
bench_subset <- function(h5ad_paths, iterations) {
8+
bench_subset <- function(h5ad_paths, iterations, zarr_paths) {
99
results <- list()
1010
path <- h5ad_paths[["float_csparse"]]
1111

12-
for (backend in c("InMemoryAnnData", "HDF5AnnData")) {
13-
short <- if (backend == "InMemoryAnnData") "InMemory" else "HDF5"
14-
ad <- read_h5ad(path, as = backend)
12+
for (backend in c("InMemoryAnnData", "HDF5AnnData", "ZarrAnnData")) {
13+
short <- switch(
14+
backend,
15+
InMemoryAnnData = "InMemory",
16+
HDF5AnnData = "HDF5",
17+
ZarrAnnData = "Zarr"
18+
)
19+
ad <- if (backend == "ZarrAnnData") {
20+
read_zarr(zarr_paths[["float_csparse"]], as = "ZarrAnnData")
21+
} else {
22+
read_h5ad(path, as = backend)
23+
}
1524

1625
n_obs <- ad$n_obs()
1726
n_vars <- ad$n_vars()
@@ -123,7 +132,22 @@ bench_subset <- function(h5ad_paths, iterations) {
123132
)
124133
)
125134

126-
# Clean up
135+
# --- Materialize view → Zarr ---
136+
results <- c(
137+
results,
138+
run_one_benchmark(
139+
name = paste0("materialize_to_Zarr_", short),
140+
expr = quote({
141+
.tmp <- tempfile()
142+
.result <- .view$as_ZarrAnnData(.tmp)
143+
unlink(.tmp, recursive = TRUE)
144+
}),
145+
iterations = iterations,
146+
env = env4
147+
)
148+
)
149+
150+
# Clean up (ZarrAnnData holds no persistent file handles)
127151
if (backend == "HDF5AnnData") {
128152
ad$close()
129153
}

benchmarks/suites/bench_write.R

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
# with different compression settings and X matrix types.
66
# =============================================================================
77

8-
bench_write <- function(h5ad_paths, iterations, x_types) {
8+
bench_write <- function(h5ad_paths, iterations, x_types, zarr_paths) {
99
results <- list()
1010

1111
compressions <- c("none", "gzip")
@@ -57,5 +57,50 @@ bench_write <- function(h5ad_paths, iterations, x_types) {
5757
}
5858
}
5959

60+
# Write to Zarr store
61+
for (xt in x_types) {
62+
path <- zarr_paths[[xt]]
63+
64+
for (compression in compressions) {
65+
# Write from InMemoryAnnData → Zarr
66+
env <- new.env(parent = globalenv())
67+
env$.ad <- read_zarr(path, as = "InMemoryAnnData")
68+
env$.compression <- compression
69+
70+
results <- c(
71+
results,
72+
run_one_benchmark(
73+
name = paste0("write_zarr_InMemory_", xt, "_", compression),
74+
expr = quote({
75+
.tmp <- tempfile()
76+
.ad$as_ZarrAnnData(.tmp, compression = .compression)
77+
unlink(.tmp, recursive = TRUE)
78+
}),
79+
iterations = iterations,
80+
env = env
81+
)
82+
)
83+
84+
# Write from ZarrAnnData → Zarr
85+
env2 <- new.env(parent = globalenv())
86+
env2$.ad <- read_zarr(path, as = "ZarrAnnData")
87+
env2$.compression <- compression
88+
89+
results <- c(
90+
results,
91+
run_one_benchmark(
92+
name = paste0("write_zarr_Zarr_", xt, "_", compression),
93+
expr = quote({
94+
.tmp <- tempfile()
95+
.ad$as_ZarrAnnData(.tmp, compression = .compression)
96+
unlink(.tmp, recursive = TRUE)
97+
}),
98+
iterations = iterations,
99+
env = env2
100+
)
101+
)
102+
}
103+
}
104+
60105
results
61106
}

0 commit comments

Comments
 (0)