Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions clickhouserewrite/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# ClickHouse rewrite

This folder tracks the prototype migration of the IGVF Catalog API from ArangoDB to ClickHouse. The goal is to evaluate ClickHouse as a backend for the catalog's node and edge queries, focusing on human variants data.

## At a glance

- **ClickHouse server**: EC2 instance at `35.85.61.200:8123` (HTTP interface)
- **Data source**: S3 bucket `s3://igvf-catalog-parsed-collections/` containing JSONL exports from ArangoDB
- **Node.js driver**: `@clickhouse/client`
- **Schema source of truth**: `data/db/generated_schemas/*.sql` — one `CREATE TABLE` file per collection

For the full status of every API endpoint (port state, backing router file, OpenAPI excerpt) see [`endpoints/README.md`](endpoints/README.md). Re-run [`scripts/generate_endpoint_docs.py`](../scripts/generate_endpoint_docs.py) to refresh it after porting.

## Navigation

- [`infrastructure.md`](infrastructure.md) — `src/database.ts`, `src/env.ts`, `config/development.json`
- [`data-loading.md`](data-loading.md) — `data/db/generate_import.py` + `clickhouse_import.yaml`
- [`collections.md`](collections.md) — ClickHouse table inventory, projections, materialized views
- [`design-decisions/`](design-decisions/) — patterns and rationale (parameterized queries, two-step enrichment, lean projections, region pushdown, …)
- [`routers/`](routers/) — per-router architecture notes
- [`endpoints/`](endpoints/) — one file per OpenAPI endpoint, with status and spec excerpt
- [`testing.md`](testing.md) — endpoint test results and latency observations
- [`limitations.md`](limitations.md) — known limitations
- [`conventions.md`](conventions.md) — porting conventions for new routers

## Read this if you are…

- **…porting a new router**: read [`conventions.md`](conventions.md) and skim the relevant pages under [`design-decisions/`](design-decisions/). Cross-check the target endpoint's stub under [`endpoints/`](endpoints/).
- **…debugging a specific endpoint**: open its file under [`endpoints/`](endpoints/) — the OpenAPI excerpt and status line tell you most of what you need.
- **…doing capacity / latency work**: [`testing.md`](testing.md) and the per-design-decision latency tables.
15 changes: 15 additions & 0 deletions clickhouserewrite/collections.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Collections loaded into ClickHouse

| Table | Row count (approx) | Needed by | Notes |
|---|---|---|---|
| `variants` | ~1.2 billion | `/variants`, `/variants/phenotypes` (verbose), `/phenotypes/variants` (verbose) | Human variants only (FAVOR + IGVF). Primary key is `id` (SPDI-like identifier). Has lean projections on `spdi`, `ca_id`, `hgvs` for fast two-step lookups. See [design-decisions/06-lean-projections.md](design-decisions/06-lean-projections.md). |
| `rsid_to_variant` | — | `/variants?rsid=...` | Auto-updating lookup table (materialized view). Unnests `Array(String)` rsid column into `(rsid, variant_id)` pairs sorted by `rsid`. See [design-decisions/07-rsid-materialized-view.md](design-decisions/07-rsid-materialized-view.md). |
| `variants_phenotypes` | Loaded | `/phenotypes/variants`, `/variants/phenotypes` | Edge table linking variants to ontology terms. FK columns: `variants_id`, `ontology_terms_id`. |
| `ontology_terms` | Loaded | `/phenotypes/variants` (phenotype name resolution) | Joined to resolve phenotype names from IDs. |
| `studies` | Loaded | `/phenotypes/variants` (verbose GWAS), `/variants/phenotypes` (verbose GWAS) | GWAS study metadata, joined in verbose mode. |
| `variants_phenotypes_studies` | Loaded | `/phenotypes/variants` (GWAS path), `/variants/phenotypes` (GWAS path) | Hyperedge table connecting variant-phenotype pairs to studies. Contains GWAS statistics (`log10pvalue`, `beta`, `p_val`, etc.). |
| `motifs` | Loaded | Not yet used by ported endpoints | Loaded as part of the import tooling validation. |
| `coding_variants` | ~1.56B rows | `/genes/coding-variants/scores`, `/genes/coding-variants/all-scores` | Protein-level coding variant records. `id` = `{gene_name}_{transcript}_{hgvsp}_{hgvsc}` — gene name is the id prefix, enabling implicit PK clustering per gene. |
| `coding_variants_phenotypes` | ~1.1B rows | `/genes/coding-variants/scores`, `/genes/coding-variants/all-scores` | Edge table from coding variants to ontology terms (phenotypes). `id` = `{coding_variants_id}_{ontology_term}_{fileset}`. The `variants` column stores the linked genomic variant FK for assay types where the phenotype is tied to a specific nucleotide change (SGE). Has `proj_by_cv_id` lean projection `(SELECT coding_variants_id, id ORDER BY coding_variants_id)` for efficient two-step lookup by `coding_variants_id`. See [design-decisions/gcv-02-cvp-two-step.md](design-decisions/gcv-02-cvp-two-step.md). |
| `variants_coding_variants` | ~1.56B rows | `/genes/coding-variants/scores` | Edge table from genomic variants to coding variants. `id` = `{variants_id}_{coding_variants_id}`. Has `proj_by_cv_id` lean projection `(SELECT coding_variants_id, variants_id ORDER BY coding_variants_id)` which fully satisfies the Step D query by primary key. See [design-decisions/gcv-04-vcv-projection.md](design-decisions/gcv-04-vcv-projection.md). |
| `variants_variants` | ~12B rows (post-symmetrization; ~6B unique LD pairs) | `/variants/variant-ld` | TopLD linkage-disequilibrium pairs across AFR/EAS/EUR/SAS ancestries. Each LD pair stored twice (once per direction) so that `WHERE variants_1_id = ?` is a single-column equality on the PK prefix. `ORDER BY (variants_1_id, ancestry, variants_2_id)`, `r2`/`d_prime` are Float32, `id` column dropped. Largest table in the catalog. See [design-decisions/09-symmetrize-edge-tables.md](design-decisions/09-symmetrize-edge-tables.md). |
17 changes: 17 additions & 0 deletions clickhouserewrite/conventions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Porting conventions

All other routers in `src/routers/datatypeRouters/edges/` and `src/routers/datatypeRouters/nodes/` still use AQL. The pattern established by `variants.ts` and `variants_phenotypes.ts` can be followed for each:

1. Replace AQL with parameterized ClickHouse SQL — see [design-decisions/01-parameterized-queries.md](design-decisions/01-parameterized-queries.md)
2. Use `chQuery()` with `query_params` for all user input
3. Use two-step enrichment for verbose mode JOINs against large tables — see [design-decisions/02-two-step-enrichment.md](design-decisions/02-two-step-enrichment.md)
4. Use three-query pagination for endpoints that merge results from multiple sources — see [design-decisions/03-three-query-pagination.md](design-decisions/03-three-query-pagination.md)
5. Use lean projections + two-step ID resolution for high-cardinality string lookups on large tables — see [design-decisions/06-lean-projections.md](design-decisions/06-lean-projections.md)
6. Use materialized view lookup tables for array column lookups (e.g. `rsid`) — see [design-decisions/07-rsid-materialized-view.md](design-decisions/07-rsid-materialized-view.md)
7. Push regions down as subqueries on the `variants` table for edge endpoints — see [design-decisions/08-region-queries.md](design-decisions/08-region-queries.md)

## Important: always use optimized variant lookups

Any endpoint that resolves variant identifiers (`spdi`, `hgvs`, `ca_id`, `rsid`) to variant IDs **must** use the optimized lookup paths — lean projections for `spdi`/`hgvs`/`ca_id` and the `rsid_to_variant` materialized view for `rsid`. Never query the `variants` table directly with `WHERE spdi = ...`, `WHERE hgvs = ...`, or `WHERE has(rsid, ...)` when selecting full rows or when used as a subquery without the lean projection.

This applies to both direct variant queries (`/variants`) and any edge endpoint that accepts variant identifiers as input (e.g. `/variants/phenotypes`, `/variants/genes`, etc.). The `variantIDSearch()` and `findVariantIDByRSID()` functions in `variants.ts` already use the optimized paths and should be reused by all edge routers that resolve variant identifiers. Bypassing these functions with direct `has(rsid, ...)` or unaliased `WHERE spdi = ...` queries against the 1.2B-row `variants` table will result in 60s+ timeouts.
8 changes: 8 additions & 0 deletions clickhouserewrite/data-loading.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Data loading tooling

| File | Purpose |
|---|---|
| `data/db/generate_import.py` | Python script that generates ClickHouse `INSERT INTO ... SELECT ... FROM s3(...)` YAML statements from a `.sql` schema file. Handles PK→`_key`, FK→`_from`/`_to` transforms, backtick-quoting for special column names, and uses actual SQL types for S3 schema fields. |
| `data/db/schema/clickhouse_import.yaml` | Collection of `INSERT` statements for all tables, aligned to the generated schemas. |

See [`collections.md`](collections.md) for the resulting table inventory.
18 changes: 18 additions & 0 deletions clickhouserewrite/design-decisions/01-parameterized-queries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Parameterized queries for SQL injection prevention

The `variants.ts` prototype used an `esc()` helper with string interpolation to build SQL. For `variants_phenotypes.ts`, we switched to ClickHouse's native parameterized queries using the `{name:Type}` syntax with `query_params`:

```typescript
async function chQuery<T = any>(sql: string, params?: QueryParams): Promise<T[]> {
const resultSet = await db.query({
query: sql,
query_params: params,
format: 'JSONEachRow'
})
return await resultSet.json()
}
```

All user-supplied values (phenotype IDs, method, class, label, fileset names) go through `query_params` and are never interpolated into the SQL string. This eliminates SQL injection by design.

The one exception is the `sqlInList()` helper used for arrays of internal IDs (from `variantIDSearch()` or the VP page query). These IDs originate from our own database queries, not from user input, and are single-quote escaped as a safety measure.
20 changes: 20 additions & 0 deletions clickhouserewrite/design-decisions/02-two-step-enrichment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Two-step enrichment instead of JOINs for verbose mode

The initial implementation joined the `variants` table directly in the query:

```sql
SELECT ... FROM variants_phenotypes vp
LEFT JOIN variants v ON v.id = vp.variants_id -- 1.2B rows!
WHERE ...
```

This caused ClickHouse to build a hash table from the entire `variants` table (1.2B rows) in memory, resulting in a 60-second timeout. ClickHouse's hash JOIN loads the right-side table entirely into memory before evaluating the join condition.

The fix uses a two-step approach:

1. Run the base query (no variant/study JOINs) to get the page of results
2. Collect unique `variants_id` and `studies_id` values from results
3. Batch-fetch details using primary key lookups: `SELECT ... FROM variants WHERE id IN ('id1', 'id2', ...)`
4. Merge in TypeScript using `Map<string, any>`

Since the page size is at most 100 rows, the `IN` clause in step 3 contains at most 100 IDs, and the primary key lookup is fast. This brought verbose mode from 60s timeout to ~600ms for the phenotypes endpoint and ~3.6s for the variants endpoint (the extra time comes from `variantIDSearch` which must scan the variants table for the initial ID resolution).
17 changes: 17 additions & 0 deletions clickhouserewrite/design-decisions/03-three-query-pagination.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Three-query pagination for combined IGVF + GWAS results

When no `source` filter is specified, both IGVF and GWAS records need to be returned with correct pagination. IGVF and GWAS results have different schemas (IGVF has `score`; GWAS has `log10pvalue`, `beta`, study references, etc.), making a `UNION ALL` cumbersome.

The approach:

1. **Page query**: Fetch the VP record IDs for the current page, including their `source`:
```sql
SELECT vp.id, vp.source FROM variants_phenotypes vp
WHERE ... ORDER BY vp.id LIMIT {lim} OFFSET {off}
```
2. **Detail queries**: Split the IDs by source and run IGVF and GWAS detail queries in parallel
3. **Merge**: Reconstruct results in the original page order using a `Map`

This gives correct global pagination across both sources — the VP table is the source of ordering, and each VP record produces either an IGVF or GWAS result.

When `source` is explicitly specified (the common case for API consumers), this simplifies to a single query.
8 changes: 8 additions & 0 deletions clickhouserewrite/design-decisions/04-range-filter-parsing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# `parseRangeFilter` for log10pvalue

The original AQL code used `getFilterStatements()` to parse range expressions. The ClickHouse version has a standalone `parseRangeFilter()` that supports `gte:5`, `lt:10`, `range:5-10`, and bare numbers. It returns a structured object used by `pvalueCondition()` to build parameterized SQL conditions:

```typescript
// Input: "gte:5" → Output: vps.log10pvalue >= {_pval:Float64} with params { _pval: 5 }
// Input: "range:5-10" → Output: vps.log10pvalue >= {_pval_lo:Float64} AND ... < {_pval_hi:Float64}
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# ID prefix preservation for API compatibility

In ArangoDB, `record._from` returns values like `variants/NC_000001.11:91420:T:C`. In ClickHouse, the FK column `variants_id` stores just `NC_000001.11:91420:T:C`. API consumers may depend on the collection prefix, so non-verbose mode prepends it:

```typescript
variant: `variants/${row.variants_id}`
study: `studies/${row.studies_id}`
```
22 changes: 22 additions & 0 deletions clickhouserewrite/design-decisions/06-lean-projections.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Lean projections and the two-step query strategy

High-cardinality string lookups (`spdi`, `hgvs`, `ca_id`) on the 1.2 billion-row `variants` table are inherently expensive because the primary key order (`id`) does not help these filters, and the `annotations` JSON column is large (~2KB+ per row). Early approaches used bloom_filter data-skipping indexes, which reduced query times from ~55s to ~7s, and then "fat" projections (containing all columns, re-sorted by the lookup column), which achieved ~12s cold cache / ~120ms warm cache. The fat projections were slow on cold cache because each granule included the bulky `annotations` column, meaning a single lookup read ~16MB from disk.

**Lean projections** solve this by storing only the lookup column and the primary key:

```sql
ALTER TABLE variants ADD PROJECTION proj_spdi_lean (SELECT id, spdi ORDER BY spdi);
ALTER TABLE variants ADD PROJECTION proj_ca_id_lean (SELECT id, ca_id ORDER BY ca_id);
ALTER TABLE variants ADD PROJECTION proj_hgvs_lean (SELECT id, hgvs ORDER BY hgvs);
```

A lean projection granule is ~200-300KB compressed (vs ~16MB for a fat projection), making cold-cache reads near-instant (~15ms).

**Two-step query strategy** in `variants.ts`:

1. **Step 1 — Resolve IDs**: `SELECT id FROM variants WHERE spdi = {v:String}` uses the lean projection. Only reads the two-column projection data (~300KB). Returns the matching primary key(s) in ~15ms.
2. **Step 2 — Fetch full rows**: `SELECT ... FROM variants v WHERE v.id = {id:String}` uses the primary key. Reads only the specific row's data part, including annotations.

This is implemented in `resolveViaLeanProjection()` which intercepts `spdi`, `ca_id`, and `hgvs` filters before the main query, resolves them to primary key IDs, and rewrites the WHERE clause to use `v.id = ...` or `v.id IN (...)`.

For latency numbers, see [`testing.md`](../testing.md).
20 changes: 20 additions & 0 deletions clickhouserewrite/design-decisions/07-rsid-materialized-view.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Materialized view lookup table for `rsid`

The `rsid` column is `Array(String)` — a variant can have multiple rsids, and an rsid can map to multiple variants. Projections cannot unnest arrays, so a **materialized view** provides an auto-updating lookup table:

```sql
CREATE TABLE rsid_to_variant (
rsid String,
variant_id String
) ENGINE = MergeTree() ORDER BY rsid;

CREATE MATERIALIZED VIEW mv_rsid_to_variant TO rsid_to_variant AS
SELECT r AS rsid, id AS variant_id
FROM variants ARRAY JOIN rsid AS r;
```

The `ARRAY JOIN` unpacks each rsid array element into a separate row. The materialized view auto-fires on every `INSERT INTO variants`, so the lookup table stays in sync without manual maintenance. Existing data is backfilled once with `INSERT INTO rsid_to_variant SELECT r, id FROM variants ARRAY JOIN rsid AS r`.

The lookup table is sorted by `rsid`, so queries are a binary search. The two-step strategy applies: `resolveViaLeanProjection()` also handles rsid by querying `rsid_to_variant` first, then fetching full variant rows by primary key.

**Performance**: rsid lookups went from 60s+ timeout to ~350-400ms cold cache, ~150ms warm cache.
16 changes: 16 additions & 0 deletions clickhouserewrite/design-decisions/08-region-queries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Region queries — no additional indexing needed, pushed down as subquery for edges

Region queries filter by `chr` and `pos`. The `variants` table primary key is `id` (SPDI format, e.g. `NC_000015.10:32709532:T:A`), which starts with the chromosome reference sequence. This means the MergeTree sort order naturally clusters rows by chromosome, and ClickHouse can skip most granules using the primary key index alone.

For **edge endpoints** that accept `region` as input (e.g. `/variants/phenotypes?region=...`), the variant IDs are NOT materialized in TypeScript. Even a 1kb region can yield thousands of multi-allelic variant IDs, and serializing them via `sqlInList` blows past ClickHouse's `max_query_size` (262144 bytes default). Instead, the region is pushed down as a subquery on the variants table:

```sql
WHERE vp.variants_id IN (
SELECT id FROM variants WHERE chr = {_vp_chr:String}
AND pos >= {_vp_pos_start:Float64} AND pos < {_vp_pos_end:Float64}
)
```

ClickHouse evaluates the inner SELECT efficiently using primary-key granule pruning (variants is sorted by SPDI `id`, which starts with the chromosome RefSeq accession). This pattern should be reused by every edge router that accepts `region` — never round-trip a region's variant IDs through `sqlInList`.

For latency numbers, see [`testing.md`](../testing.md).
Loading