Skip to content

feat: replace DuckDB with in-memory AnnDataStore; faithful h5ad round-trip export#39

Open
patcon wants to merge 6 commits into
mainfrom
feat/anndata-store-replace-duckdb
Open

feat: replace DuckDB with in-memory AnnDataStore; faithful h5ad round-trip export#39
patcon wants to merge 6 commits into
mainfrom
feat/anndata-store-replace-duckdb

Conversation

@patcon

@patcon patcon commented May 22, 2026

Copy link
Copy Markdown
Owner

Summary

  • Drops DuckDB-WASM entirely (~100 MB of WASM assets + ~10 MB JS bundle removed) and replaces all vote queries with synchronous typed-array operations over a dense Float32Array vote matrix (AnnDataStore)
  • Normalizes all import modes at load time: h5ad, Kedro, and local-file paths all produce the same AnnDataStore singleton, so the rest of the app is mode-agnostic
  • Faithful h5ad round-trip export: the download now starts from the original file bytes and patches in only what changed — preserving X, varm, varp, obsp, all var columns, all uns fields, full-dimensional obsm embeddings (e.g. PCA), and HDF5 group attributes
  • User-recomputed projections included in export: DruidJS results (e.g. X_umap_recomputed) are written into obsm on download
  • calculateRepresentativeStatements and calculateStatementVoteStats are now synchronous — no async DuckDB connection needed

Key files

File Change
src/lib/anndata-store.ts New — central in-memory store, toH5adBytes() with raw-bytes round-trip
src/lib/parquet-reader.ts New — hyparquet-based Parquet reader replacing DuckDB's read_parquet
src/lib/h5ad-loader.ts Returns rawBytes for lossless export
src/lib/duckdb.ts Deleted
public/duckdb/ Deleted (7 WASM/worker files, ~100 MB)
packages/reddwarf-ts/src/db.ts Pure functional, no DuckDB
packages/reddwarf-ts/src/representative-statements.ts Synchronous

Test plan

  • Load an h5ad file → download it → verify downloaded file contains X, varm, varp, obsp, all original var columns, and obs index uses the original column name (e.g. voter-id)
  • Paint groups → download → verify obs/manual_painted reflects current painting
  • Run Recompute (DruidJS) → download → verify obsm/X_<algo>_recomputed is present
  • Vote heatmap, representative statements, and metrics layer all work after h5ad import
  • Kedro and local-file modes still load and display correctly
  • Bundle size: main JS chunk ~33 kB smaller; no public/duckdb/ directory in build output

🤖 Generated with Claude Code (code and ~200 words of PR description from ~120 words of human prompts across this session)

patcon and others added 6 commits May 21, 2026 22:38
All import modes (h5ad, Kedro, local files) now normalize into a
singleton AnnDataStore at load time. Vote queries become synchronous
typed-array operations over a dense Float32Array vote matrix.

Adds:
- src/lib/anndata-store.ts: central store with vote matrix + h5ad export
- src/lib/parquet-reader.ts: hyparquet-based Parquet loader for Kedro/local votes
- "Download h5ad" button (merges painted groups into obs/manual_painted)
- hyparquet + hyparquet-compressors dependencies

Removes DuckDB from the vote-query hot path; reddwarf-ts/db.ts and
calculateRepresentativeStatements are now synchronous pure functions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Deletes duckdb.ts, the @duckdb/duckdb-wasm npm dep, the 100 MB public/duckdb/
WASM assets, and the now-unused kedroBaseUrl/pipelineId props in MapOverlay
and StatementExplorerDrawer. All vote queries are now served by AnnDataStore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…round-trip

AnnDataStore now stores rawH5adBytes from the original import. toH5adBytes()
uses them as the source: it opens the raw file, copies all var columns (not
just content/moderation_state), all uns fields, full-dimensional obsm embeddings
(PCA etc.), and HDF5 group attributes to a new output file — then patches in the
current manual_painted state and any new user-computed obsm projections.

Previously the download silently dropped extra var columns, uns metadata, and
high-dimensional obsm — only the tiny subset AnnDataStore explicitly tracked
was written back.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
X, varm, varp, obsp, and any other top-level groups not explicitly
handled by the app are now copied from the source file. Previously only
obs/var/obsm/layers/uns were written.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…export

Two bugs in the round-trip copy:
- copyGroupContents was calling create_dataset without shape, so any 2D
  dataset (notably X) was written as a flat 1D array. Now passes
  child.metadata.shape so dimensions are preserved.
- obs and var index datasets were always named '_index' regardless of the
  original column name (e.g. 'voter-id', 'comment-id'). Now reads the
  source group's _index attribute to use the original name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…suffix

toH5adBytes now accepts an extraObsm param for projections that live only in
React state (not yet in AnnDataStore). handleDownloadH5ad passes
recomputedProjections so any in-browser DruidJS run shows up as X_<algo>_recomputed
in the downloaded file's obsm group.

Also renames the suffix separator from '-' to '_' so the h5wasm key is valid
as an obsm name (X_umap_recomputed rather than X_umap-recomputed).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant