Parallel DuckDB build + deploy (warehouse-duckdb) by fgregg · Pull Request #25 · labordata/warehouse

fgregg · 2026-05-27T23:05:33Z

Ships the same source data as the SQLite build, converted to DuckDB and deployed to a separate Fly app (warehouse-duckdb) with its own machine, volume, and hostname. The SQLite production app (warehouse / labordata.bunkum.us) is never touched by this workflow.

What's here

.github/workflows/refresh-data-duckdb.yml — make → convert each .db to .duckdb via python -m datasette_duckdb.convert → upload to R2 under the duckdb/ prefix → blue/green deploy. Self-bootstrapping: builds its own image and creates the first machine when none exists; reuses pull-from-r2-direct.sh and smoke-test.sh unchanged (the .duckdb files attach under the same names as make's 13 targets). Runs daily at 07:30 (30 min after the SQLite refresh).
Dockerfile.duckdb / scripts/serve-duckdb.sh — data-less image installing datasette + datasette-duckdb from the duckdb-backend branch of the fork; attaches /data/*.duckdb immutable, no --crossdb, no inspect-file.
datasette.duckdb.yml — config without the sqlite-only canned queries.
fly.duckdb.toml — separate app, own machine + volume (no [mounts], matching fly.toml's reasoning).

Prerequisites

The image installs from git+https://github.com/fgregg/datasette@duckdb-backend — that branch is pushed and includes the converter fix to skip FTS virtual + shadow tables (verified on f7.db: 536,851 rows convert cleanly).
Reuses the existing FLY_API_TOKEN and R2_* secrets.

First run

Trigger Refresh data (DuckDB) via workflow_dispatch. The bootstrap run creates the warehouse-duckdb app + IPs, builds + converts + ships, and promotes.

Known gaps (non-blocking; tracked)

No FTS search yet — the converter doesn't build indexes.
The cross-db union_names canned queries aren't ported (sqlite-specific).
No inspect-data.json (datasette inspect is sqlite-shaped).

🤖 Generated with Claude Code

Ships the same source data as the SQLite build, converted to DuckDB and deployed to a separate Fly app (warehouse-duckdb) with its own machine, volume, and hostname. The SQLite production app is never touched. - refresh-data-duckdb.yml: make -> convert each .db to .duckdb via `python -m datasette_duckdb.convert` -> upload to R2 under duckdb/ -> blue/green deploy. Self-bootstrapping: builds its own image and creates the first machine when none exists; reuses pull-from-r2-direct.sh and smoke-test.sh unchanged (db names match make's 13 targets). - Dockerfile.duckdb / serve-duckdb.sh: data-less image installing datasette + datasette-duckdb from the duckdb-backend branch; attaches /data/*.duckdb immutable, no --crossdb, no inspect-file. - datasette.duckdb.yml: config without the sqlite-only canned queries. - fly.duckdb.toml: separate app, own machine + volume (no [mounts]). Known gaps (non-blocking): no FTS search yet (converter doesn't build indexes) and the cross-db union_names canned queries aren't ported. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lets us bootstrap warehouse-duckdb end-to-end before merging to main (workflow_dispatch requires the workflow on the default branch). Push trigger is scoped to duckdb-parallel-build so main / the SQLite deploy are untouched, and push events promote so the run leaves a live app. Revert both TEMP blocks before merging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

requirements.txt installs the no_limit_csv datasette fork first, so pip treated datasette-duckdb's unpinned `datasette` dependency as already satisfied and never pulled the duckdb-backend branch. Result: `datasette.backends` missing, and convert.py fails on import (the plugin __init__ eagerly imports the backend). Force-reinstall datasette from the branch last, and add an import smoke-check so a bad install fails fast (before the ~12-min make) instead of at the convert step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The image installs datasette + datasette-duckdb from git+https URLs, but python:3.12-slim has no git, so the remote Docker build failed with "Cannot find command 'git'". Add git to the apt install. (The SQLite image avoids this by using archive-zip URLs.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Create staging machine failed fast on the bootstrap run: the image was pushed seconds earlier and Fly's registry 404s the manifest for a bit after push (same propagation lag deploy.yml retries around for machine update). Our single-shot machine run had no retry, and OUT=$(...2>&1) under set -e exited before echoing the error, hiding it. Now: tee the output (visible) and retry up to 15×15s until a Machine ID appears. A 404 pull failure creates no machine, so retrying is safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Root cause of the smoke-test failure: serve-duckdb.sh attached the files with `-i`, which opens them with datasette's default SQLite backend. It then tried to read each .duckdb file as SQLite and crashed datasette on startup (sqlite3.DatabaseError: file is not a database), crash-looping the machine. Earlier steps passed because hallpass/SSH runs independently of datasette, and the empty-/data first boot had nothing to misread. The datasette-duckdb backend mounts databases through plugin config (plugins.datasette-duckdb.databases), not -i. So discover /data/*.duckdb at boot, merge them into a runtime copy of datasette.duckdb.yml, and serve that. Validated locally: datasette serves nlrb via the duckdb backend and `select count(*) from docket` returns 2,053,059, no startup error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The smoke test timed out because the first /-/databases.json took ~60s. Diagnosed on a debug machine (logs read directly): not memory (1.7GB free) and not DuckDB (raw introspection of 13 dbs is instant) — it's datasette's one-time cold schema scan, and `--setting trace_debug 1` (datasette-pretty-traces capturing a Python stack per query) inflates it ~40x. Measured: cold /-/databases.json 59.5s with trace_debug vs 1.54s without. The SQLite site hides this behind --inspect-file; DuckDB has no inspect file so it pays the full scan. trace_debug is just a debugging aid, so drop it here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Smoke still fails with real data (dies ~13s, no output) where synthetic 13-db tests on a debug machine passed — so it's a real-data startup issue I need the actual staging machine's logs for. Add a failure-time step that dumps flyctl logs + memory/process state into CI output, and stop tearing down staging on failure so I can SSH in. Both marked TEMP; revert once green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Real-data smoke crash root cause (exit_code=3, not OOM): the workflow's upload step (`aws s3 sync --include "*.duckdb"`) and populate step (`ASSETS=$(ls *.duckdb)`) both matched the repo's Dockerfile.duckdb, so the Dockerfile got shipped to R2 and pulled onto the volume. Then serve-duckdb.sh's glob('/data/*.duckdb') picked it up, the plugin tried to open the Dockerfile text as a DuckDB file, and datasette crashed on startup — crash-loop, no logs after first boot. Renaming to Dockerfile-duckdb (+ matching fly.duckdb.toml) removes the class of bug at the root, no defensive filters needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Belongs with the previous rename commit; got dropped from it because the failed pathspec on the (now-renamed) Dockerfile.duckdb in my git add unstaged fly.duckdb.toml between add and commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Diagnosed by reading the staging machine directly: datasette starts fine, but its cold /-/databases.json over the real ~4GB of duckdb files on shared-cpu-1x takes 245s (measured). The smoke test's 30s per-request timeout was tripping on the in-progress scan. The SQLite track avoids this with --inspect-file; we don't have a DuckDB equivalent yet, so the runtime scan happens once per process. Insert a warmup step between Wait-for-SSH and Smoke that patiently hits /-/databases.json (600s timeout). After it completes the catalog is cached and smoke is instant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous warmup hit "Connection refused" 0.5s in: it tried databases.json while datasette was still attaching the 13 dbs (Wait-for-SSH only checks Fly's hallpass daemon, not datasette). Mirror smoke-test.sh's pattern — phase 1 polls versions.json until uvicorn is bound (up to 180s), phase 2 drives the cold scan (up to 600s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bootstrap is green: https://warehouse-duckdb.fly.dev is live, serving real warehouse data via the DuckDB backend (verified versions.json, databases.json after ~245s cold scan, nlrb query returning 2,070,197 docket rows). Restore: - on:push removed (back to schedule + workflow_dispatch only) - PROMOTE no longer set on push - Tear-down on failure restored (`(env.PROMOTE != 'true' || failure())`) Kept: the on-failure diagnostics step (drops machine logs + memory / process state into CI output before teardown). Without it the staging machine self-destructs and the next breakage is invisible — exactly the gap that made this bootstrap so iterative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror the SQLite track's deploy.yml + refresh-data.yml structure: image lifecycle in deploy-duckdb.yml (on push to main), data lifecycle in refresh-data-duckdb.yml (on schedule). Code/serve-script iterations drop from ~22 min to ~5 min because refresh no longer re-builds the image on every run; data refreshes can also skip the image rebuild and just reuse the current machine's image. deploy-duckdb.yml (new) — directly modeled on deploy.yml: build (flyctl deploy --build-only --push, pin digest) -> roll onto role=current via flyctl machine update (retry for ~5 min against registry-propagation 404s) -> verify running /etc/build-sha matches GITHUB_SHA -> warm up datasette (machine update restarts -> cold cache) refresh-data-duckdb.yml — slimmed: - Dropped the in-workflow `Build image` step. - Dropped `Ensure app + IPs exist` (bootstrap is now manual one-time, like the SQLite app — documented in the header). - `Discover current machine + volume + image` now also captures `.config.image`; staging machine boots from THAT, so the workflow never builds. - Added `Warm up promoted machine` step between Promote and Destroy, since `flyctl machine update --port` restarts the machine and drops the warm cache (same warmup script as the pre-smoke one). - Simplified post-bootstrap guards (no more empty-OLD branch). Bootstrap remains a one-time manual step: create the app + IPs + first machine by hand (the warehouse-duckdb app already exists with a role=current machine, so this is moot going forward). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Re-enable on-commit firing scoped to duckdb-parallel-build so we can iterate either path before merging: - deploy-duckdb.yml: push to branch fires the ~5min image roll - refresh-data-duckdb.yml: push to branch fires the full ~22min refresh (PROMOTE=true on push so the run leaves a live URL) Concurrency group warehouse-duckdb-deploy serializes them, so a push that touches both paths runs deploy first (5min), then refresh (using the just-rolled image). Revert all three TEMP blocks before merging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Install datasette from duckdb-deploy (the new merge branch = duckdb-backend ∪ no_limit_csv), which carries both the new `datasette inspect -c` flag (backend-neutral inspect for plugin-mounted dbs) and no_limit_csv's uncapped count(*). Refresh: - After convert, build inspect-data.json by mounting the local .duckdb files via plugin config and running `datasette inspect -c`. Counts are true totals (verified locally: docket 2,053,059 etc.). - Upload + populate inspect-data.json alongside the .duckdb files. - Drop both warmup steps (pre-smoke + post-promote) — first /-/databases.json drops from ~245s to ~20ms with the inspect file (measured locally). Deploy: drop the post-roll warmup step for the same reason. serve-duckdb.sh: pass --inspect-file when /data/inspect-data.json exists, matching serve.sh on the SQLite track. Removed scripts/warmup-datasette.sh (no longer used). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Refresh hit smoke at 121s but datasette took 121s to bind :8080 over 13 attached duckdb dbs (plugin discovery + 13 startup-hook add_database + 13 check_databases on shared-cpu-1x). smoke-test.sh polls only 120s internally — tight miss. The fix isn't to bump the shared smoke script — it's explicit wait-for-ready in three spots: - before smoke (refresh) — so smoke runs against a bound datasette - after promote ports (refresh) — so we don't cordon the old machine while the new one is still starting (would leak 502s to visitors) - after rootfs verify (deploy) — so the workflow doesn't end while datasette is still starting (verify uses hallpass, not datasette) Plus bump fly.duckdb.toml health-check grace_period 120s -> 240s so Fly doesn't kill the machine mid-boot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pulls in fgregg/datasette@duckdb-deploy a2c98f1d (and its parent c788edea on backend-abstraction): use the dialect-escaped table name in the inspect-data cache check. With this, /nlrb/allegation should show the real count from inspect-data.json (727,373) instead of >10,000 estimate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the table.py + table.html companion to the cache check fix — SQL aliases count(*) to 'count' so the JS can read the column key generically (DuckDB returns 'count_star()', SQLite 'count(*)'). Without this the click-for-true-count link silently fails on DuckDB.

Docker layer caching meant pip install of git+...@duckdb-deploy reused the cached layer across deploys (the URL string is identical even when the branch tip moves) -- last two deploys silently rolled out the SAME stale datasette code. /etc/build-sha matched but the datasette inside was the previous one. Resolve the current commit SHA of duckdb-deploy in the deploy workflow and pass as --build-arg DATASETTE_REF=<sha>. The Dockerfile declares ARG DATASETTE_REF=duckdb-deploy before the pip RUN and uses ${DATASETTE_REF} in the URL, so the layer key now changes whenever datasette has new commits.

Warehouse overrides datasette's table.html via --template-dir, so the upstream fix on duckdb-deploy doesn't reach the rendered page — the warehouse copy still had data['rows'][0]['count(*)'] which is wrong on DuckDB (returns null since the column is named 'count_star()'). Match the upstream fix here. (The SQLite track is unaffected because SQLite returns 'count(*)' as the column name; the upstream change aliases to 'count' which is what this JS now reads.)

new_rc_cases: datetime(created_at,'utc')||'Z' -> strftime(created_at,'%Y-%m-%dT%H:%M:%SZ') -- created_at is TIMESTAMP in the converted .duckdb file (sqlite source declared it that way) so strftime takes it directly. new_lm20_filings: Same datetime translation, but receiveDate's underlying SQLite type was TEXT so the converter leaves it as VARCHAR. Use strftime(try_cast(receiveDate as timestamp), ...) to be type-agnostic. Once #9 (column type inference) lands the try_cast becomes a no-op and we can drop it. Also wrap in a CTE because DuckDB follows the SQL standard and doesn't allow referencing SELECT aliases (atom_title / atom_content_html) in WHERE; the outer SELECT against the CTE makes that legal.

Picks up fgregg/datasette@duckdb-deploy cb5bdc65: DateFacet now suggests date facets on DuckDB (was silently never suggested due to the SQLite-only glob probe) and facet_results uses try_cast so it doesn't hard-error on dirty VARCHAR date columns.

Drop --setting allow_facet off (inherited from serve.sh). The SQLite site disables faceting because it's too expensive on 10GB+ SQLite tables; DuckDB is columnar/vectorized and faceting is cheap (group-by over 11M rows ~11ms in profiling), so enable it and rely on facet_time_limit_ms 500 as the per-facet cost guard. This exposes the DateFacet dialect fix (datasette duckdb-deploy cb5bdc65) — date columns now suggest + compute date facets. A feature the SQLite deployment can't afford.

Picks up datasette duckdb-deploy 1cdfd3f6 — table schema / CREATE TABLE display now works for DuckDB-backed tables via duckdb_tables().sql.

#13) datasette duckdb-deploy: DuckDB keyless immutable tables now keyset- paginate (was offset) and mint no rowid row-page permalinks.

…QLite, #12)

…rue)

Every push fired both deploy and a full refresh; the refresh was immediately cancelled (code changes don't need a data rebuild), producing a cancelled run per push — pure CI noise. Drop the push trigger from refresh; deploy-duckdb.yml still fires on push for fast iteration. Dispatch refresh manually when data actually changes.

This reverts commit 772f306.

Parity with serve.sh (SQLite). The datasette-duckdb backend now supports cross-database queries: at startup the plugin re-backs _memory with DuckDB and ATTACHes every mounted .duckdb (READ_ONLY), so /_memory can join across the databases -- e.g. union_names_crosswalk against nlrb/f7/lm*. DuckDB has no SQLITE_LIMIT_ATTACHED cap, so all ~14 databases are cross-queryable (the SQLite site is limited to the first 10). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fgregg and others added 30 commits May 27, 2026 19:04

Redeploy: DuckDB get_table_definition (#3)

cd9bc7c

Picks up datasette duckdb-deploy 1cdfd3f6 — table schema / CREATE TABLE display now works for DuckDB-backed tables via duckdb_tables().sql.

Redeploy: rowid capability split — keyset pagination on keyless tables (

7c3678f

#13) datasette duckdb-deploy: DuckDB keyless immutable tables now keyset- paginate (was offset) and mint no rowid row-page permalinks.

Redeploy: restore rowid row-page links on keyless tables (parity w/ S…

119e1ca

…QLite, #12)

Redeploy: drop DuckDB rowid orderability special-casing (parity, #13)

3e2b107

Redeploy: remove unused rowid capability split (just supports_rowid=T…

a91b5e9

…rue)

fgregg and others added 5 commits May 28, 2026 22:00

Revert "ci: refresh only on schedule/dispatch, not every push"

93299f0

This reverts commit 772f306.

Redeploy: #12 — no rowid column/links on immutable keyless tables

016357b

Merge branch 'main' into duckdb-parallel-build

b8a6c02

Merge branch 'main' into duckdb-parallel-build

e1f4d3e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel DuckDB build + deploy (warehouse-duckdb)#25

Parallel DuckDB build + deploy (warehouse-duckdb)#25
fgregg wants to merge 35 commits into
mainfrom
duckdb-parallel-build

fgregg commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fgregg commented May 27, 2026

What's here

Prerequisites

First run

Known gaps (non-blocking; tracked)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant