IGVF-DACC · ottojolanki · Mar 20, 2026 · Mar 23, 2026 · Apr 2, 2026 · Apr 2, 2026
diff --git a/clickhouserewrite/README.md b/clickhouserewrite/README.md
@@ -0,0 +1,30 @@
+# ClickHouse rewrite
+
+This folder tracks the prototype migration of the IGVF Catalog API from ArangoDB to ClickHouse. The goal is to evaluate ClickHouse as a backend for the catalog's node and edge queries, focusing on human variants data.
+
+## At a glance
+
+- **ClickHouse server**: EC2 instance at `35.85.61.200:8123` (HTTP interface)
+- **Data source**: S3 bucket `s3://igvf-catalog-parsed-collections/` containing JSONL exports from ArangoDB
+- **Node.js driver**: `@clickhouse/client`
+- **Schema source of truth**: `data/db/generated_schemas/*.sql` — one `CREATE TABLE` file per collection
+
+For the full status of every API endpoint (port state, backing router file, OpenAPI excerpt) see [`endpoints/README.md`](endpoints/README.md). Re-run [`scripts/generate_endpoint_docs.py`](../scripts/generate_endpoint_docs.py) to refresh it after porting.
+
+## Navigation
+
+- [`infrastructure.md`](infrastructure.md) — `src/database.ts`, `src/env.ts`, `config/development.json`
+- [`data-loading.md`](data-loading.md) — `data/db/generate_import.py` + `clickhouse_import.yaml`
+- [`collections.md`](collections.md) — ClickHouse table inventory, projections, materialized views
+- [`design-decisions/`](design-decisions/) — patterns and rationale (parameterized queries, two-step enrichment, lean projections, region pushdown, …)
+- [`routers/`](routers/) — per-router architecture notes
+- [`endpoints/`](endpoints/) — one file per OpenAPI endpoint, with status and spec excerpt
+- [`testing.md`](testing.md) — endpoint test results and latency observations
+- [`limitations.md`](limitations.md) — known limitations
+- [`conventions.md`](conventions.md) — porting conventions for new routers
+
+## Read this if you are…
+
+- **…porting a new router**: read [`conventions.md`](conventions.md) and skim the relevant pages under [`design-decisions/`](design-decisions/). Cross-check the target endpoint's stub under [`endpoints/`](endpoints/).
+- **…debugging a specific endpoint**: open its file under [`endpoints/`](endpoints/) — the OpenAPI excerpt and status line tell you most of what you need.
+- **…doing capacity / latency work**: [`testing.md`](testing.md) and the per-design-decision latency tables.
diff --git a/clickhouserewrite/collections.md b/clickhouserewrite/collections.md
@@ -0,0 +1,15 @@
+# Collections loaded into ClickHouse
+
+| Table | Row count (approx) | Needed by | Notes |
+|---|---|---|---|
+| `variants` | ~1.2 billion | `/variants`, `/variants/phenotypes` (verbose), `/phenotypes/variants` (verbose) | Human variants only (FAVOR + IGVF). Primary key is `id` (SPDI-like identifier). Has lean projections on `spdi`, `ca_id`, `hgvs` for fast two-step lookups. See [design-decisions/06-lean-projections.md](design-decisions/06-lean-projections.md). |
+| `rsid_to_variant` | — | `/variants?rsid=...` | Auto-updating lookup table (materialized view). Unnests `Array(String)` rsid column into `(rsid, variant_id)` pairs sorted by `rsid`. See [design-decisions/07-rsid-materialized-view.md](design-decisions/07-rsid-materialized-view.md). |
+| `variants_phenotypes` | Loaded | `/phenotypes/variants`, `/variants/phenotypes` | Edge table linking variants to ontology terms. FK columns: `variants_id`, `ontology_terms_id`. |
+| `ontology_terms` | Loaded | `/phenotypes/variants` (phenotype name resolution) | Joined to resolve phenotype names from IDs. |
+| `studies` | Loaded | `/phenotypes/variants` (verbose GWAS), `/variants/phenotypes` (verbose GWAS) | GWAS study metadata, joined in verbose mode. |
+| `variants_phenotypes_studies` | Loaded | `/phenotypes/variants` (GWAS path), `/variants/phenotypes` (GWAS path) | Hyperedge table connecting variant-phenotype pairs to studies. Contains GWAS statistics (`log10pvalue`, `beta`, `p_val`, etc.). |
+| `motifs` | Loaded | Not yet used by ported endpoints | Loaded as part of the import tooling validation. |
+| `coding_variants` | ~1.56B rows | `/genes/coding-variants/scores`, `/genes/coding-variants/all-scores` | Protein-level coding variant records. `id` = `{gene_name}_{transcript}_{hgvsp}_{hgvsc}` — gene name is the id prefix, enabling implicit PK clustering per gene. |
+| `coding_variants_phenotypes` | ~1.1B rows | `/genes/coding-variants/scores`, `/genes/coding-variants/all-scores` | Edge table from coding variants to ontology terms (phenotypes). `id` = `{coding_variants_id}_{ontology_term}_{fileset}`. The `variants` column stores the linked genomic variant FK for assay types where the phenotype is tied to a specific nucleotide change (SGE). Has `proj_by_cv_id` lean projection `(SELECT coding_variants_id, id ORDER BY coding_variants_id)` for efficient two-step lookup by `coding_variants_id`. See [design-decisions/gcv-02-cvp-two-step.md](design-decisions/gcv-02-cvp-two-step.md). |
+| `variants_coding_variants` | ~1.56B rows | `/genes/coding-variants/scores` | Edge table from genomic variants to coding variants. `id` = `{variants_id}_{coding_variants_id}`. Has `proj_by_cv_id` lean projection `(SELECT coding_variants_id, variants_id ORDER BY coding_variants_id)` which fully satisfies the Step D query by primary key. See [design-decisions/gcv-04-vcv-projection.md](design-decisions/gcv-04-vcv-projection.md). |
+| `variants_variants` | ~12B rows (post-symmetrization; ~6B unique LD pairs) | `/variants/variant-ld` | TopLD linkage-disequilibrium pairs across AFR/EAS/EUR/SAS ancestries. Each LD pair stored twice (once per direction) so that `WHERE variants_1_id = ?` is a single-column equality on the PK prefix. `ORDER BY (variants_1_id, ancestry, variants_2_id)`, `r2`/`d_prime` are Float32, `id` column dropped. Largest table in the catalog. See [design-decisions/09-symmetrize-edge-tables.md](design-decisions/09-symmetrize-edge-tables.md). |
diff --git a/clickhouserewrite/conventions.md b/clickhouserewrite/conventions.md
@@ -0,0 +1,17 @@
+# Porting conventions
+
+All other routers in `src/routers/datatypeRouters/edges/` and `src/routers/datatypeRouters/nodes/` still use AQL. The pattern established by `variants.ts` and `variants_phenotypes.ts` can be followed for each:
+
+1. Replace AQL with parameterized ClickHouse SQL — see [design-decisions/01-parameterized-queries.md](design-decisions/01-parameterized-queries.md)
+2. Use `chQuery()` with `query_params` for all user input
+3. Use two-step enrichment for verbose mode JOINs against large tables — see [design-decisions/02-two-step-enrichment.md](design-decisions/02-two-step-enrichment.md)
+4. Use three-query pagination for endpoints that merge results from multiple sources — see [design-decisions/03-three-query-pagination.md](design-decisions/03-three-query-pagination.md)
+5. Use lean projections + two-step ID resolution for high-cardinality string lookups on large tables — see [design-decisions/06-lean-projections.md](design-decisions/06-lean-projections.md)
+6. Use materialized view lookup tables for array column lookups (e.g. `rsid`) — see [design-decisions/07-rsid-materialized-view.md](design-decisions/07-rsid-materialized-view.md)
+7. Push regions down as subqueries on the `variants` table for edge endpoints — see [design-decisions/08-region-queries.md](design-decisions/08-region-queries.md)
+
+## Important: always use optimized variant lookups
+
+Any endpoint that resolves variant identifiers (`spdi`, `hgvs`, `ca_id`, `rsid`) to variant IDs **must** use the optimized lookup paths — lean projections for `spdi`/`hgvs`/`ca_id` and the `rsid_to_variant` materialized view for `rsid`. Never query the `variants` table directly with `WHERE spdi = ...`, `WHERE hgvs = ...`, or `WHERE has(rsid, ...)` when selecting full rows or when used as a subquery without the lean projection.
+
+This applies to both direct variant queries (`/variants`) and any edge endpoint that accepts variant identifiers as input (e.g. `/variants/phenotypes`, `/variants/genes`, etc.). The `variantIDSearch()` and `findVariantIDByRSID()` functions in `variants.ts` already use the optimized paths and should be reused by all edge routers that resolve variant identifiers. Bypassing these functions with direct `has(rsid, ...)` or unaliased `WHERE spdi = ...` queries against the 1.2B-row `variants` table will result in 60s+ timeouts.
diff --git a/clickhouserewrite/data-loading.md b/clickhouserewrite/data-loading.md
@@ -0,0 +1,8 @@
+# Data loading tooling
+
+| File | Purpose |
+|---|---|
+| `data/db/generate_import.py` | Python script that generates ClickHouse `INSERT INTO ... SELECT ... FROM s3(...)` YAML statements from a `.sql` schema file. Handles PK→`_key`, FK→`_from`/`_to` transforms, backtick-quoting for special column names, and uses actual SQL types for S3 schema fields. |
+| `data/db/schema/clickhouse_import.yaml` | Collection of `INSERT` statements for all tables, aligned to the generated schemas. |
+
+See [`collections.md`](collections.md) for the resulting table inventory.
diff --git a/clickhouserewrite/design-decisions/01-parameterized-queries.md b/clickhouserewrite/design-decisions/01-parameterized-queries.md
@@ -0,0 +1,18 @@
+# Parameterized queries for SQL injection prevention
+
+The `variants.ts` prototype used an `esc()` helper with string interpolation to build SQL. For `variants_phenotypes.ts`, we switched to ClickHouse's native parameterized queries using the `{name:Type}` syntax with `query_params`:
+
+```typescript
+async function chQuery<T = any>(sql: string, params?: QueryParams): Promise<T[]> {
+  const resultSet = await db.query({
+    query: sql,
+    query_params: params,
+    format: 'JSONEachRow'
+  })
+  return await resultSet.json()
+}
+```
+
+All user-supplied values (phenotype IDs, method, class, label, fileset names) go through `query_params` and are never interpolated into the SQL string. This eliminates SQL injection by design.
+
+The one exception is the `sqlInList()` helper used for arrays of internal IDs (from `variantIDSearch()` or the VP page query). These IDs originate from our own database queries, not from user input, and are single-quote escaped as a safety measure.
diff --git a/clickhouserewrite/design-decisions/02-two-step-enrichment.md b/clickhouserewrite/design-decisions/02-two-step-enrichment.md
@@ -0,0 +1,20 @@
+# Two-step enrichment instead of JOINs for verbose mode
+
+The initial implementation joined the `variants` table directly in the query:
+
+```sql
+SELECT ... FROM variants_phenotypes vp
+LEFT JOIN variants v ON v.id = vp.variants_id  -- 1.2B rows!
+WHERE ...
+```
+
+This caused ClickHouse to build a hash table from the entire `variants` table (1.2B rows) in memory, resulting in a 60-second timeout. ClickHouse's hash JOIN loads the right-side table entirely into memory before evaluating the join condition.
+
+The fix uses a two-step approach:
+
+1. Run the base query (no variant/study JOINs) to get the page of results
+2. Collect unique `variants_id` and `studies_id` values from results
+3. Batch-fetch details using primary key lookups: `SELECT ... FROM variants WHERE id IN ('id1', 'id2', ...)`
+4. Merge in TypeScript using `Map<string, any>`
+
+Since the page size is at most 100 rows, the `IN` clause in step 3 contains at most 100 IDs, and the primary key lookup is fast. This brought verbose mode from 60s timeout to ~600ms for the phenotypes endpoint and ~3.6s for the variants endpoint (the extra time comes from `variantIDSearch` which must scan the variants table for the initial ID resolution).
diff --git a/clickhouserewrite/design-decisions/03-three-query-pagination.md b/clickhouserewrite/design-decisions/03-three-query-pagination.md
@@ -0,0 +1,17 @@
+# Three-query pagination for combined IGVF + GWAS results
+
+When no `source` filter is specified, both IGVF and GWAS records need to be returned with correct pagination. IGVF and GWAS results have different schemas (IGVF has `score`; GWAS has `log10pvalue`, `beta`, study references, etc.), making a `UNION ALL` cumbersome.
+
+The approach:
+
+1. **Page query**: Fetch the VP record IDs for the current page, including their `source`:
+   ```sql
+   SELECT vp.id, vp.source FROM variants_phenotypes vp
+   WHERE ... ORDER BY vp.id LIMIT {lim} OFFSET {off}
+   ```
+2. **Detail queries**: Split the IDs by source and run IGVF and GWAS detail queries in parallel
+3. **Merge**: Reconstruct results in the original page order using a `Map`
+
+This gives correct global pagination across both sources — the VP table is the source of ordering, and each VP record produces either an IGVF or GWAS result.
+
+When `source` is explicitly specified (the common case for API consumers), this simplifies to a single query.
diff --git a/clickhouserewrite/design-decisions/04-range-filter-parsing.md b/clickhouserewrite/design-decisions/04-range-filter-parsing.md
@@ -0,0 +1,8 @@
+# `parseRangeFilter` for log10pvalue
+
+The original AQL code used `getFilterStatements()` to parse range expressions. The ClickHouse version has a standalone `parseRangeFilter()` that supports `gte:5`, `lt:10`, `range:5-10`, and bare numbers. It returns a structured object used by `pvalueCondition()` to build parameterized SQL conditions:
+
+```typescript
+// Input: "gte:5" → Output: vps.log10pvalue >= {_pval:Float64} with params { _pval: 5 }
+// Input: "range:5-10" → Output: vps.log10pvalue >= {_pval_lo:Float64} AND ... < {_pval_hi:Float64}
+```
diff --git a/clickhouserewrite/design-decisions/05-id-prefix-preservation.md b/clickhouserewrite/design-decisions/05-id-prefix-preservation.md
@@ -0,0 +1,8 @@
+# ID prefix preservation for API compatibility
+
+In ArangoDB, `record._from` returns values like `variants/NC_000001.11:91420:T:C`. In ClickHouse, the FK column `variants_id` stores just `NC_000001.11:91420:T:C`. API consumers may depend on the collection prefix, so non-verbose mode prepends it:
+
+```typescript
+variant: `variants/${row.variants_id}`
+study: `studies/${row.studies_id}`
+```
diff --git a/clickhouserewrite/design-decisions/06-lean-projections.md b/clickhouserewrite/design-decisions/06-lean-projections.md
@@ -0,0 +1,22 @@
+# Lean projections and the two-step query strategy
+
+High-cardinality string lookups (`spdi`, `hgvs`, `ca_id`) on the 1.2 billion-row `variants` table are inherently expensive because the primary key order (`id`) does not help these filters, and the `annotations` JSON column is large (~2KB+ per row). Early approaches used bloom_filter data-skipping indexes, which reduced query times from ~55s to ~7s, and then "fat" projections (containing all columns, re-sorted by the lookup column), which achieved ~12s cold cache / ~120ms warm cache. The fat projections were slow on cold cache because each granule included the bulky `annotations` column, meaning a single lookup read ~16MB from disk.
+
+**Lean projections** solve this by storing only the lookup column and the primary key:
+
+```sql
+ALTER TABLE variants ADD PROJECTION proj_spdi_lean (SELECT id, spdi ORDER BY spdi);
+ALTER TABLE variants ADD PROJECTION proj_ca_id_lean (SELECT id, ca_id ORDER BY ca_id);
+ALTER TABLE variants ADD PROJECTION proj_hgvs_lean (SELECT id, hgvs ORDER BY hgvs);
+```
+
+A lean projection granule is ~200-300KB compressed (vs ~16MB for a fat projection), making cold-cache reads near-instant (~15ms).
+
+**Two-step query strategy** in `variants.ts`:
+
+1. **Step 1 — Resolve IDs**: `SELECT id FROM variants WHERE spdi = {v:String}` uses the lean projection. Only reads the two-column projection data (~300KB). Returns the matching primary key(s) in ~15ms.
+2. **Step 2 — Fetch full rows**: `SELECT ... FROM variants v WHERE v.id = {id:String}` uses the primary key. Reads only the specific row's data part, including annotations.
+
+This is implemented in `resolveViaLeanProjection()` which intercepts `spdi`, `ca_id`, and `hgvs` filters before the main query, resolves them to primary key IDs, and rewrites the WHERE clause to use `v.id = ...` or `v.id IN (...)`.
+
+For latency numbers, see [`testing.md`](../testing.md).
diff --git a/clickhouserewrite/design-decisions/07-rsid-materialized-view.md b/clickhouserewrite/design-decisions/07-rsid-materialized-view.md
@@ -0,0 +1,20 @@
+# Materialized view lookup table for `rsid`
+
+The `rsid` column is `Array(String)` — a variant can have multiple rsids, and an rsid can map to multiple variants. Projections cannot unnest arrays, so a **materialized view** provides an auto-updating lookup table:
+
+```sql
+CREATE TABLE rsid_to_variant (
+    rsid String,
+    variant_id String
+) ENGINE = MergeTree() ORDER BY rsid;
+
+CREATE MATERIALIZED VIEW mv_rsid_to_variant TO rsid_to_variant AS
+SELECT r AS rsid, id AS variant_id
+FROM variants ARRAY JOIN rsid AS r;
+```
+
+The `ARRAY JOIN` unpacks each rsid array element into a separate row. The materialized view auto-fires on every `INSERT INTO variants`, so the lookup table stays in sync without manual maintenance. Existing data is backfilled once with `INSERT INTO rsid_to_variant SELECT r, id FROM variants ARRAY JOIN rsid AS r`.
+
+The lookup table is sorted by `rsid`, so queries are a binary search. The two-step strategy applies: `resolveViaLeanProjection()` also handles rsid by querying `rsid_to_variant` first, then fetching full variant rows by primary key.
+
+**Performance**: rsid lookups went from 60s+ timeout to ~350-400ms cold cache, ~150ms warm cache.
diff --git a/clickhouserewrite/design-decisions/08-region-queries.md b/clickhouserewrite/design-decisions/08-region-queries.md
@@ -0,0 +1,16 @@
+# Region queries — no additional indexing needed, pushed down as subquery for edges
+
+Region queries filter by `chr` and `pos`. The `variants` table primary key is `id` (SPDI format, e.g. `NC_000015.10:32709532:T:A`), which starts with the chromosome reference sequence. This means the MergeTree sort order naturally clusters rows by chromosome, and ClickHouse can skip most granules using the primary key index alone.
+
+For **edge endpoints** that accept `region` as input (e.g. `/variants/phenotypes?region=...`), the variant IDs are NOT materialized in TypeScript. Even a 1kb region can yield thousands of multi-allelic variant IDs, and serializing them via `sqlInList` blows past ClickHouse's `max_query_size` (262144 bytes default). Instead, the region is pushed down as a subquery on the variants table:
+
+```sql
+WHERE vp.variants_id IN (
+  SELECT id FROM variants WHERE chr = {_vp_chr:String}
+    AND pos >= {_vp_pos_start:Float64} AND pos < {_vp_pos_end:Float64}
+)
+```
+
+ClickHouse evaluates the inner SELECT efficiently using primary-key granule pruning (variants is sorted by SPDI `id`, which starts with the chromosome RefSeq accession). This pattern should be reused by every edge router that accepts `region` — never round-trip a region's variant IDs through `sqlInList`.
+
+For latency numbers, see [`testing.md`](../testing.md).