Skip to content

Parallel DuckDB build + deploy (warehouse-duckdb)#25

Open
fgregg wants to merge 35 commits into
mainfrom
duckdb-parallel-build
Open

Parallel DuckDB build + deploy (warehouse-duckdb)#25
fgregg wants to merge 35 commits into
mainfrom
duckdb-parallel-build

Conversation

@fgregg

@fgregg fgregg commented May 27, 2026

Copy link
Copy Markdown
Contributor

Ships the same source data as the SQLite build, converted to DuckDB and deployed to a separate Fly app (warehouse-duckdb) with its own machine, volume, and hostname. The SQLite production app (warehouse / labordata.bunkum.us) is never touched by this workflow.

What's here

  • .github/workflows/refresh-data-duckdb.ymlmake → convert each .db to .duckdb via python -m datasette_duckdb.convert → upload to R2 under the duckdb/ prefix → blue/green deploy. Self-bootstrapping: builds its own image and creates the first machine when none exists; reuses pull-from-r2-direct.sh and smoke-test.sh unchanged (the .duckdb files attach under the same names as make's 13 targets). Runs daily at 07:30 (30 min after the SQLite refresh).
  • Dockerfile.duckdb / scripts/serve-duckdb.sh — data-less image installing datasette + datasette-duckdb from the duckdb-backend branch of the fork; attaches /data/*.duckdb immutable, no --crossdb, no inspect-file.
  • datasette.duckdb.yml — config without the sqlite-only canned queries.
  • fly.duckdb.toml — separate app, own machine + volume (no [mounts], matching fly.toml's reasoning).

Prerequisites

  • The image installs from git+https://github.com/fgregg/datasette@duckdb-backend — that branch is pushed and includes the converter fix to skip FTS virtual + shadow tables (verified on f7.db: 536,851 rows convert cleanly).
  • Reuses the existing FLY_API_TOKEN and R2_* secrets.

First run

Trigger Refresh data (DuckDB) via workflow_dispatch. The bootstrap run creates the warehouse-duckdb app + IPs, builds + converts + ships, and promotes.

Known gaps (non-blocking; tracked)

  • No FTS search yet — the converter doesn't build indexes.
  • The cross-db union_names canned queries aren't ported (sqlite-specific).
  • No inspect-data.json (datasette inspect is sqlite-shaped).

🤖 Generated with Claude Code

fgregg and others added 30 commits May 27, 2026 19:04
Ships the same source data as the SQLite build, converted to DuckDB and
deployed to a separate Fly app (warehouse-duckdb) with its own machine,
volume, and hostname. The SQLite production app is never touched.

- refresh-data-duckdb.yml: make -> convert each .db to .duckdb via
  `python -m datasette_duckdb.convert` -> upload to R2 under duckdb/ ->
  blue/green deploy. Self-bootstrapping: builds its own image and creates
  the first machine when none exists; reuses pull-from-r2-direct.sh and
  smoke-test.sh unchanged (db names match make's 13 targets).
- Dockerfile.duckdb / serve-duckdb.sh: data-less image installing datasette
  + datasette-duckdb from the duckdb-backend branch; attaches /data/*.duckdb
  immutable, no --crossdb, no inspect-file.
- datasette.duckdb.yml: config without the sqlite-only canned queries.
- fly.duckdb.toml: separate app, own machine + volume (no [mounts]).

Known gaps (non-blocking): no FTS search yet (converter doesn't build
indexes) and the cross-db union_names canned queries aren't ported.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets us bootstrap warehouse-duckdb end-to-end before merging to main
(workflow_dispatch requires the workflow on the default branch). Push
trigger is scoped to duckdb-parallel-build so main / the SQLite deploy
are untouched, and push events promote so the run leaves a live app.

Revert both TEMP blocks before merging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
requirements.txt installs the no_limit_csv datasette fork first, so pip
treated datasette-duckdb's unpinned `datasette` dependency as already
satisfied and never pulled the duckdb-backend branch. Result:
`datasette.backends` missing, and convert.py fails on import (the
plugin __init__ eagerly imports the backend).

Force-reinstall datasette from the branch last, and add an import
smoke-check so a bad install fails fast (before the ~12-min make)
instead of at the convert step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The image installs datasette + datasette-duckdb from git+https URLs, but
python:3.12-slim has no git, so the remote Docker build failed with
"Cannot find command 'git'". Add git to the apt install. (The SQLite
image avoids this by using archive-zip URLs.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Create staging machine failed fast on the bootstrap run: the image was
pushed seconds earlier and Fly's registry 404s the manifest for a bit
after push (same propagation lag deploy.yml retries around for machine
update). Our single-shot machine run had no retry, and OUT=$(...2>&1)
under set -e exited before echoing the error, hiding it.

Now: tee the output (visible) and retry up to 15×15s until a Machine ID
appears. A 404 pull failure creates no machine, so retrying is safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the smoke-test failure: serve-duckdb.sh attached the files
with `-i`, which opens them with datasette's default SQLite backend. It
then tried to read each .duckdb file as SQLite and crashed datasette on
startup (sqlite3.DatabaseError: file is not a database), crash-looping
the machine. Earlier steps passed because hallpass/SSH runs independently
of datasette, and the empty-/data first boot had nothing to misread.

The datasette-duckdb backend mounts databases through plugin config
(plugins.datasette-duckdb.databases), not -i. So discover /data/*.duckdb
at boot, merge them into a runtime copy of datasette.duckdb.yml, and serve
that. Validated locally: datasette serves nlrb via the duckdb backend and
`select count(*) from docket` returns 2,053,059, no startup error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The smoke test timed out because the first /-/databases.json took ~60s.
Diagnosed on a debug machine (logs read directly): not memory (1.7GB
free) and not DuckDB (raw introspection of 13 dbs is instant) — it's
datasette's one-time cold schema scan, and `--setting trace_debug 1`
(datasette-pretty-traces capturing a Python stack per query) inflates it
~40x. Measured: cold /-/databases.json 59.5s with trace_debug vs 1.54s
without. The SQLite site hides this behind --inspect-file; DuckDB has no
inspect file so it pays the full scan. trace_debug is just a debugging
aid, so drop it here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Smoke still fails with real data (dies ~13s, no output) where synthetic
13-db tests on a debug machine passed — so it's a real-data startup issue
I need the actual staging machine's logs for. Add a failure-time step that
dumps flyctl logs + memory/process state into CI output, and stop tearing
down staging on failure so I can SSH in. Both marked TEMP; revert once green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-data smoke crash root cause (exit_code=3, not OOM): the workflow's
upload step (`aws s3 sync --include "*.duckdb"`) and populate step
(`ASSETS=$(ls *.duckdb)`) both matched the repo's Dockerfile.duckdb, so
the Dockerfile got shipped to R2 and pulled onto the volume. Then
serve-duckdb.sh's glob('/data/*.duckdb') picked it up, the plugin tried
to open the Dockerfile text as a DuckDB file, and datasette crashed on
startup — crash-loop, no logs after first boot.

Renaming to Dockerfile-duckdb (+ matching fly.duckdb.toml) removes the
class of bug at the root, no defensive filters needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Belongs with the previous rename commit; got dropped from it because
the failed pathspec on the (now-renamed) Dockerfile.duckdb in my git add
unstaged fly.duckdb.toml between add and commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Diagnosed by reading the staging machine directly: datasette starts
fine, but its cold /-/databases.json over the real ~4GB of duckdb files
on shared-cpu-1x takes 245s (measured). The smoke test's 30s per-request
timeout was tripping on the in-progress scan. The SQLite track avoids
this with --inspect-file; we don't have a DuckDB equivalent yet, so the
runtime scan happens once per process.

Insert a warmup step between Wait-for-SSH and Smoke that patiently hits
/-/databases.json (600s timeout). After it completes the catalog is
cached and smoke is instant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous warmup hit "Connection refused" 0.5s in: it tried databases.json
while datasette was still attaching the 13 dbs (Wait-for-SSH only checks
Fly's hallpass daemon, not datasette). Mirror smoke-test.sh's pattern —
phase 1 polls versions.json until uvicorn is bound (up to 180s), phase 2
drives the cold scan (up to 600s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bootstrap is green: https://warehouse-duckdb.fly.dev is live, serving
real warehouse data via the DuckDB backend (verified versions.json,
databases.json after ~245s cold scan, nlrb query returning 2,070,197
docket rows). Restore:

- on:push removed (back to schedule + workflow_dispatch only)
- PROMOTE no longer set on push
- Tear-down on failure restored (`(env.PROMOTE != 'true' || failure())`)

Kept: the on-failure diagnostics step (drops machine logs + memory /
process state into CI output before teardown). Without it the staging
machine self-destructs and the next breakage is invisible — exactly the
gap that made this bootstrap so iterative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the SQLite track's deploy.yml + refresh-data.yml structure: image
lifecycle in deploy-duckdb.yml (on push to main), data lifecycle in
refresh-data-duckdb.yml (on schedule). Code/serve-script iterations drop
from ~22 min to ~5 min because refresh no longer re-builds the image on
every run; data refreshes can also skip the image rebuild and just reuse
the current machine's image.

deploy-duckdb.yml (new) — directly modeled on deploy.yml:
  build (flyctl deploy --build-only --push, pin digest)
  -> roll onto role=current via flyctl machine update (retry for ~5 min
     against registry-propagation 404s)
  -> verify running /etc/build-sha matches GITHUB_SHA
  -> warm up datasette (machine update restarts -> cold cache)

refresh-data-duckdb.yml — slimmed:
  - Dropped the in-workflow `Build image` step.
  - Dropped `Ensure app + IPs exist` (bootstrap is now manual one-time,
    like the SQLite app — documented in the header).
  - `Discover current machine + volume + image` now also captures
    `.config.image`; staging machine boots from THAT, so the workflow
    never builds.
  - Added `Warm up promoted machine` step between Promote and Destroy,
    since `flyctl machine update --port` restarts the machine and
    drops the warm cache (same warmup script as the pre-smoke one).
  - Simplified post-bootstrap guards (no more empty-OLD branch).

Bootstrap remains a one-time manual step: create the app + IPs + first
machine by hand (the warehouse-duckdb app already exists with a
role=current machine, so this is moot going forward).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-enable on-commit firing scoped to duckdb-parallel-build so we can
iterate either path before merging:
- deploy-duckdb.yml: push to branch fires the ~5min image roll
- refresh-data-duckdb.yml: push to branch fires the full ~22min refresh
  (PROMOTE=true on push so the run leaves a live URL)

Concurrency group warehouse-duckdb-deploy serializes them, so a push that
touches both paths runs deploy first (5min), then refresh (using the
just-rolled image).

Revert all three TEMP blocks before merging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Install datasette from duckdb-deploy (the new merge branch =
duckdb-backend ∪ no_limit_csv), which carries both the new
`datasette inspect -c` flag (backend-neutral inspect for plugin-mounted
dbs) and no_limit_csv's uncapped count(*).

Refresh:
- After convert, build inspect-data.json by mounting the local
  .duckdb files via plugin config and running `datasette inspect -c`.
  Counts are true totals (verified locally: docket 2,053,059 etc.).
- Upload + populate inspect-data.json alongside the .duckdb files.
- Drop both warmup steps (pre-smoke + post-promote) — first
  /-/databases.json drops from ~245s to ~20ms with the inspect file
  (measured locally).

Deploy: drop the post-roll warmup step for the same reason.

serve-duckdb.sh: pass --inspect-file when /data/inspect-data.json
exists, matching serve.sh on the SQLite track.

Removed scripts/warmup-datasette.sh (no longer used).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refresh hit smoke at 121s but datasette took 121s to bind :8080 over
13 attached duckdb dbs (plugin discovery + 13 startup-hook add_database
+ 13 check_databases on shared-cpu-1x). smoke-test.sh polls only 120s
internally — tight miss.

The fix isn't to bump the shared smoke script — it's explicit
wait-for-ready in three spots:
- before smoke (refresh) — so smoke runs against a bound datasette
- after promote ports (refresh) — so we don't cordon the old machine
  while the new one is still starting (would leak 502s to visitors)
- after rootfs verify (deploy) — so the workflow doesn't end while
  datasette is still starting (verify uses hallpass, not datasette)

Plus bump fly.duckdb.toml health-check grace_period 120s -> 240s so
Fly doesn't kill the machine mid-boot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls in fgregg/datasette@duckdb-deploy a2c98f1d (and its parent
c788edea on backend-abstraction): use the dialect-escaped table name
in the inspect-data cache check. With this, /nlrb/allegation should
show the real count from inspect-data.json (727,373) instead of
>10,000 estimate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the table.py + table.html companion to the cache check fix —
SQL aliases count(*) to 'count' so the JS can read the column key
generically (DuckDB returns 'count_star()', SQLite 'count(*)').
Without this the click-for-true-count link silently fails on DuckDB.
Docker layer caching meant pip install of git+...@duckdb-deploy reused
the cached layer across deploys (the URL string is identical even when
the branch tip moves) -- last two deploys silently rolled out the SAME
stale datasette code. /etc/build-sha matched but the datasette inside
was the previous one.

Resolve the current commit SHA of duckdb-deploy in the deploy workflow
and pass as --build-arg DATASETTE_REF=<sha>. The Dockerfile declares
ARG DATASETTE_REF=duckdb-deploy before the pip RUN and uses
${DATASETTE_REF} in the URL, so the layer key now changes whenever
datasette has new commits.
Warehouse overrides datasette's table.html via --template-dir, so the
upstream fix on duckdb-deploy doesn't reach the rendered page — the
warehouse copy still had data['rows'][0]['count(*)'] which is wrong
on DuckDB (returns null since the column is named 'count_star()').

Match the upstream fix here. (The SQLite track is unaffected because
SQLite returns 'count(*)' as the column name; the upstream change
aliases to 'count' which is what this JS now reads.)
new_rc_cases:
  datetime(created_at,'utc')||'Z' -> strftime(created_at,'%Y-%m-%dT%H:%M:%SZ')
  -- created_at is TIMESTAMP in the converted .duckdb file (sqlite source
  declared it that way) so strftime takes it directly.

new_lm20_filings:
  Same datetime translation, but receiveDate's underlying SQLite type was
  TEXT so the converter leaves it as VARCHAR. Use
    strftime(try_cast(receiveDate as timestamp), ...)
  to be type-agnostic. Once #9 (column type inference) lands the try_cast
  becomes a no-op and we can drop it.

  Also wrap in a CTE because DuckDB follows the SQL standard and doesn't
  allow referencing SELECT aliases (atom_title / atom_content_html) in
  WHERE; the outer SELECT against the CTE makes that legal.
Picks up fgregg/datasette@duckdb-deploy cb5bdc65: DateFacet now
suggests date facets on DuckDB (was silently never suggested due to
the SQLite-only glob probe) and facet_results uses try_cast so it
doesn't hard-error on dirty VARCHAR date columns.
Drop --setting allow_facet off (inherited from serve.sh). The SQLite
site disables faceting because it's too expensive on 10GB+ SQLite
tables; DuckDB is columnar/vectorized and faceting is cheap (group-by
over 11M rows ~11ms in profiling), so enable it and rely on
facet_time_limit_ms 500 as the per-facet cost guard.

This exposes the DateFacet dialect fix (datasette duckdb-deploy
cb5bdc65) — date columns now suggest + compute date facets. A feature
the SQLite deployment can't afford.
Picks up datasette duckdb-deploy 1cdfd3f6 — table schema / CREATE TABLE
display now works for DuckDB-backed tables via duckdb_tables().sql.
#13)

datasette duckdb-deploy: DuckDB keyless immutable tables now keyset-
paginate (was offset) and mint no rowid row-page permalinks.
Every push fired both deploy and a full refresh; the refresh was
immediately cancelled (code changes don't need a data rebuild),
producing a cancelled run per push — pure CI noise. Drop the push
trigger from refresh; deploy-duckdb.yml still fires on push for fast
iteration. Dispatch refresh manually when data actually changes.
fgregg and others added 5 commits May 28, 2026 22:00
Parity with serve.sh (SQLite). The datasette-duckdb backend now supports
cross-database queries: at startup the plugin re-backs _memory with DuckDB and
ATTACHes every mounted .duckdb (READ_ONLY), so /_memory can join across the
databases -- e.g. union_names_crosswalk against nlrb/f7/lm*. DuckDB has no
SQLITE_LIMIT_ATTACHED cap, so all ~14 databases are cross-queryable (the SQLite
site is limited to the first 10).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant