Conversation
…sets Add `@cbioportal-cell-explorer/cli` package that reuses the zarrstore library to query AnnData-formatted zarr datasets from the command line. Commands: - `info` — dataset overview (shape, columns, embeddings) - `obs [column]` — list obs columns or show value counts - `var [--search pattern]` — list or search gene names by symbol or ID - `expression <gene> [--group-by col]` — expression stats per group - `embedding [key]` — list or dump embedding coordinates Also adds a Claude Code plugin with a `/zarr` skill that translates natural language queries into CLI commands (e.g. "what is the expression of EGFR across cell types?"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces a new @cbioportal-cell-explorer/cli package to query AnnData zarr datasets over HTTP (via the existing @cbioportal-cell-explorer/zarrstore), and adds a Claude Code plugin skill (/zarr) that maps natural-language questions to CLI commands.
Changes:
- Added a Node/TypeScript CLI (
cbioportal-cell-explorer) with commands for dataset info, obs/var inspection, gene expression stats, and embeddings. - Added a Claude Code skill definition (
/zarr) documenting how to run the CLI and how to map user intents to commands. - Updated repo docs and lockfile to include the new package and dependencies.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| skills/zarr/SKILL.md | Adds Claude Code /zarr skill instructions and command mappings |
| .claude-plugin/plugin.json | Declares the Claude plugin metadata |
| packages/cli/package.json | Defines the new CLI package, bin entry, and dependencies |
| packages/cli/tsconfig.json | Adds TS config for the CLI package |
| packages/cli/src/main.ts | CLI entrypoint and command dispatch |
| packages/cli/src/commands/*.ts | Implements info, obs, var, expression, embedding commands |
| packages/cli/src/util/*.ts | Adds argument parsing, formatting, gene resolution, and stats utilities |
| README.md | Documents CLI usage and the Claude /zarr skill |
| pnpm-lock.yaml | Locks new dependencies and workspace importer for packages/cli |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| }); | ||
|
|
||
| const search = values.search as string | undefined; | ||
| const limit = values.limit ? parseInt(values.limit as string, 10) : undefined; |
There was a problem hiding this comment.
limit is parsed with parseInt(...) but the result isn’t validated. If a user passes a non-numeric value (or --limit 0/negative), limit can become NaN and slice(0, cap) will return an empty list with no clear error. Validate limit (finite integer > 0) and fail with a helpful message when invalid.
| const limit = values.limit ? parseInt(values.limit as string, 10) : undefined; | |
| const rawLimit = values.limit as string | undefined; | |
| let limit: number | undefined; | |
| if (rawLimit !== undefined) { | |
| const parsedLimit = Number.parseInt(rawLimit, 10); | |
| if (!Number.isFinite(parsedLimit) || parsedLimit <= 0) { | |
| console.error( | |
| `Invalid --limit value "${rawLimit}". Please provide a positive integer.`, | |
| ); | |
| process.exitCode = 1; | |
| return; | |
| } | |
| limit = parsedLimit; | |
| } |
| }); | ||
|
|
||
| const key = positionals[0]; | ||
| const limit = values.limit ? parseInt(values.limit as string, 10) : undefined; |
There was a problem hiding this comment.
limit is parsed with parseInt(...) but not validated; invalid values can produce NaN, which makes display become NaN and results in no rows being printed/returned. Validate that limit is a finite positive integer and error out (or fall back to default) when it isn’t.
| const limit = values.limit ? parseInt(values.limit as string, 10) : undefined; | |
| const rawLimit = values.limit as string | undefined; | |
| let limit: number | undefined; | |
| if (rawLimit !== undefined) { | |
| const parsed = Number.parseInt(rawLimit, 10); | |
| if (!Number.isFinite(parsed) || parsed <= 0) { | |
| console.error("`--limit` must be a positive integer."); | |
| process.exit(1); | |
| } | |
| limit = parsed; | |
| } |
| export function computeGroupedStats( | ||
| values: ArrayLike<number>, | ||
| labels: ArrayLike<string | number | null>, | ||
| ): GroupStats[] { | ||
| const groups = new Map<string, number[]>(); | ||
|
|
||
| for (let i = 0; i < values.length; i++) { | ||
| const label = String(labels[i] ?? "null"); | ||
| let arr = groups.get(label); | ||
| if (!arr) { | ||
| arr = []; | ||
| groups.set(label, arr); | ||
| } | ||
| arr.push(values[i]); | ||
| } | ||
|
|
||
| const results: GroupStats[] = []; | ||
| for (const [group, vals] of groups) { | ||
| const s = computeStats(vals); | ||
| results.push({ group, ...s }); | ||
| } | ||
|
|
||
| // Sort by count descending | ||
| results.sort((a, b) => b.count - a.count); | ||
| return results; |
There was a problem hiding this comment.
computeGroupedStats builds per-group number[] arrays and then computeStats copies/sorts those values again to compute medians. For large datasets (hundreds of thousands of cells) this can be slow and memory-heavy. Consider computing per-group stats in a streaming/2-pass way (count/sum/sumSq/min/max/nonzero without storing all values) and making median optional/approximate for grouped stats to keep the CLI responsive.
| export function computeStats(values: ArrayLike<number>): Omit<GroupStats, "group"> { | ||
| const n = values.length; | ||
| if (n === 0) { | ||
| return { count: 0, mean: 0, median: 0, std: 0, min: 0, max: 0, nonzeroCount: 0, nonzeroPct: 0 }; | ||
| } | ||
|
|
||
| let sum = 0; | ||
| let min = Infinity; | ||
| let max = -Infinity; | ||
| let nonzero = 0; | ||
|
|
||
| for (let i = 0; i < n; i++) { | ||
| const v = values[i]; | ||
| sum += v; | ||
| if (v < min) min = v; | ||
| if (v > max) max = v; | ||
| if (v !== 0) nonzero++; | ||
| } | ||
|
|
||
| const mean = sum / n; | ||
|
|
||
| let sumSqDiff = 0; | ||
| for (let i = 0; i < n; i++) { | ||
| const d = values[i] - mean; | ||
| sumSqDiff += d * d; | ||
| } | ||
| const std = Math.sqrt(sumSqDiff / n); | ||
|
|
||
| // Sort a copy for median | ||
| const sorted = Array.from(values as ArrayLike<number>).sort((a, b) => a - b); | ||
|
|
||
| return { | ||
| count: n, | ||
| mean, | ||
| median: median(sorted), | ||
| std, | ||
| min, | ||
| max, | ||
| nonzeroCount: nonzero, | ||
| nonzeroPct: (nonzero / n) * 100, | ||
| }; | ||
| } |
There was a problem hiding this comment.
This new CLI package adds non-trivial logic (arg parsing, gene resolution, stats computation) but there are currently no automated tests under packages/cli. Since the repo already uses Vitest elsewhere, consider adding basic unit tests for computeStats/computeGroupedStats and resolveGeneName (including not-found and case-insensitive column matching) to prevent regressions.
| async function main() { | ||
| const opts = parseGlobalArgs(process.argv); | ||
|
|
||
| if (opts.help && !opts.command) { | ||
| console.log(USAGE); | ||
| process.exit(0); | ||
| } | ||
|
|
||
| const handler = COMMANDS[opts.command]; | ||
| if (!handler) { | ||
| if (opts.command) { | ||
| console.error(`Unknown command: ${opts.command}`); | ||
| } | ||
| console.error(USAGE); | ||
| process.exit(1); | ||
| } | ||
|
|
||
| await handler(opts); |
There was a problem hiding this comment.
--help only prints the top-level usage when no subcommand is provided. When users run cbioportal-cell-explorer <command> --help, the flag is parsed but ignored, so the command executes instead of showing help. Handle opts.help when a command is present (either in main() before dispatch, or inside each command handler) and exit 0 after printing the relevant usage.
| import { parseGlobalArgs } from "./util/args.ts"; | ||
| import { info } from "./commands/info.ts"; | ||
| import { obs } from "./commands/obs.ts"; | ||
| import { varCmd } from "./commands/var.ts"; | ||
| import { expression } from "./commands/expression.ts"; | ||
| import { embedding } from "./commands/embedding.ts"; |
There was a problem hiding this comment.
Relative imports include the .ts extension (e.g. ./util/args.ts), but the rest of the repo uses extensionless TS imports. Aligning with the existing convention avoids inconsistent module specifiers and reduces friction if any code is later bundled/compiled with different resolution rules.
| import { parseGlobalArgs } from "./util/args.ts"; | |
| import { info } from "./commands/info.ts"; | |
| import { obs } from "./commands/obs.ts"; | |
| import { varCmd } from "./commands/var.ts"; | |
| import { expression } from "./commands/expression.ts"; | |
| import { embedding } from "./commands/embedding.ts"; | |
| import { parseGlobalArgs } from "./util/args"; | |
| import { info } from "./commands/info"; | |
| import { obs } from "./commands/obs"; | |
| import { varCmd } from "./commands/var"; | |
| import { expression } from "./commands/expression"; | |
| import { embedding } from "./commands/embedding"; |
|
|
||
| const geneInput = positionals[0]; | ||
| if (!geneInput) { | ||
| console.error("Usage: cell-explorer expression <gene> [--group-by <obs_column>]"); |
There was a problem hiding this comment.
The usage error message refers to cell-explorer instead of the actual binary name cbioportal-cell-explorer, which is confusing and inconsistent with the README/USAGE text. Update the message to match the real command name.
| console.error("Usage: cell-explorer expression <gene> [--group-by <obs_column>]"); | |
| console.error("Usage: cbioportal-cell-explorer expression <gene> [--group-by <obs_column>]"); |
| 2. **Via npx** (if the CLI is published to npm): | ||
| ```bash | ||
| npx @cbioportal-cell-explorer/cli <command> [options] | ||
| ``` |
There was a problem hiding this comment.
The SKILL.md suggests trying npx @cbioportal-cell-explorer/cli ... as the second option, but the current CLI entrypoint requires the tsx runtime. Unless the published package ships compiled JS or includes tsx as a runtime dependency, this npx invocation will fail. Update the instructions to either use the tsx fallback directly or note the packaging requirement.
| "bin": { | ||
| "cbioportal-cell-explorer": "./src/main.ts" | ||
| }, | ||
| "exports": "./src/main.ts", | ||
| "scripts": { | ||
| "typecheck": "tsc --noEmit" | ||
| }, | ||
| "dependencies": { | ||
| "@cbioportal-cell-explorer/zarrstore": "workspace:*", | ||
| "@petamoriken/float16": "^3.9.3" | ||
| }, | ||
| "devDependencies": { | ||
| "@types/node": "^25.3.3", | ||
| "tsx": "^4.21.0", | ||
| "typescript": "^5.9.3" | ||
| } |
There was a problem hiding this comment.
The bin points directly at a TypeScript file (./src/main.ts) with a #!/usr/bin/env tsx shebang. When installed via npx @cbioportal-cell-explorer/cli, tsx won’t reliably be on the user’s PATH (devDependencies aren’t installed, and even a nested dependency’s bin typically isn’t exposed). To make npx/global installs work, publish a JS entrypoint (e.g. build to dist/ and point bin/exports there) or use a small JS shim that invokes an internal TS runner in a way that doesn’t depend on tsx being globally available.
Summary
@cbioportal-cell-explorer/clipackage — a Node CLI tool for querying AnnData zarr datasets over HTTP, reusing the existing@cbioportal-cell-explorer/zarrstorelibrary/zarrslash skill for natural language queries against zarr datasetsCLI Commands
infoobs [column]var [--search pattern]expression <gene> [--group-by col]embedding [key]All commands support
--url <zarr_url>to override the default dataset and--jsonfor machine-readable output. Gene symbols (e.g.EGFR) are resolved automatically via thefeature_namevar column.Example
Claude Code Skill
The
/zarrskill translates natural language into CLI commands:Usage:
claude --plugin-dir .from the repo root.Test plan
pnpm installsucceedspnpm exec cbioportal-cell-explorer inforeturns dataset overviewpnpm exec cbioportal-cell-explorer obs cell_typeshows value countspnpm exec cbioportal-cell-explorer var --search EGFRfinds EGFRpnpm exec cbioportal-cell-explorer expression EGFR --group-by cell_typeshows grouped statspnpm exec cbioportal-cell-explorer embedding X_umap50 --limit 5dumps coordinates--jsonflag produces valid JSON for all commandsclaude --plugin-dir .discovers the/zarrskill🤖 Generated with Claude Code