Parallel DuckDB build + deploy (warehouse-duckdb)#25
Open
fgregg wants to merge 35 commits into
Open
Conversation
Ships the same source data as the SQLite build, converted to DuckDB and deployed to a separate Fly app (warehouse-duckdb) with its own machine, volume, and hostname. The SQLite production app is never touched. - refresh-data-duckdb.yml: make -> convert each .db to .duckdb via `python -m datasette_duckdb.convert` -> upload to R2 under duckdb/ -> blue/green deploy. Self-bootstrapping: builds its own image and creates the first machine when none exists; reuses pull-from-r2-direct.sh and smoke-test.sh unchanged (db names match make's 13 targets). - Dockerfile.duckdb / serve-duckdb.sh: data-less image installing datasette + datasette-duckdb from the duckdb-backend branch; attaches /data/*.duckdb immutable, no --crossdb, no inspect-file. - datasette.duckdb.yml: config without the sqlite-only canned queries. - fly.duckdb.toml: separate app, own machine + volume (no [mounts]). Known gaps (non-blocking): no FTS search yet (converter doesn't build indexes) and the cross-db union_names canned queries aren't ported. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets us bootstrap warehouse-duckdb end-to-end before merging to main (workflow_dispatch requires the workflow on the default branch). Push trigger is scoped to duckdb-parallel-build so main / the SQLite deploy are untouched, and push events promote so the run leaves a live app. Revert both TEMP blocks before merging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
requirements.txt installs the no_limit_csv datasette fork first, so pip treated datasette-duckdb's unpinned `datasette` dependency as already satisfied and never pulled the duckdb-backend branch. Result: `datasette.backends` missing, and convert.py fails on import (the plugin __init__ eagerly imports the backend). Force-reinstall datasette from the branch last, and add an import smoke-check so a bad install fails fast (before the ~12-min make) instead of at the convert step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The image installs datasette + datasette-duckdb from git+https URLs, but python:3.12-slim has no git, so the remote Docker build failed with "Cannot find command 'git'". Add git to the apt install. (The SQLite image avoids this by using archive-zip URLs.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Create staging machine failed fast on the bootstrap run: the image was pushed seconds earlier and Fly's registry 404s the manifest for a bit after push (same propagation lag deploy.yml retries around for machine update). Our single-shot machine run had no retry, and OUT=$(...2>&1) under set -e exited before echoing the error, hiding it. Now: tee the output (visible) and retry up to 15×15s until a Machine ID appears. A 404 pull failure creates no machine, so retrying is safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the smoke-test failure: serve-duckdb.sh attached the files with `-i`, which opens them with datasette's default SQLite backend. It then tried to read each .duckdb file as SQLite and crashed datasette on startup (sqlite3.DatabaseError: file is not a database), crash-looping the machine. Earlier steps passed because hallpass/SSH runs independently of datasette, and the empty-/data first boot had nothing to misread. The datasette-duckdb backend mounts databases through plugin config (plugins.datasette-duckdb.databases), not -i. So discover /data/*.duckdb at boot, merge them into a runtime copy of datasette.duckdb.yml, and serve that. Validated locally: datasette serves nlrb via the duckdb backend and `select count(*) from docket` returns 2,053,059, no startup error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The smoke test timed out because the first /-/databases.json took ~60s. Diagnosed on a debug machine (logs read directly): not memory (1.7GB free) and not DuckDB (raw introspection of 13 dbs is instant) — it's datasette's one-time cold schema scan, and `--setting trace_debug 1` (datasette-pretty-traces capturing a Python stack per query) inflates it ~40x. Measured: cold /-/databases.json 59.5s with trace_debug vs 1.54s without. The SQLite site hides this behind --inspect-file; DuckDB has no inspect file so it pays the full scan. trace_debug is just a debugging aid, so drop it here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Smoke still fails with real data (dies ~13s, no output) where synthetic 13-db tests on a debug machine passed — so it's a real-data startup issue I need the actual staging machine's logs for. Add a failure-time step that dumps flyctl logs + memory/process state into CI output, and stop tearing down staging on failure so I can SSH in. Both marked TEMP; revert once green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-data smoke crash root cause (exit_code=3, not OOM): the workflow's
upload step (`aws s3 sync --include "*.duckdb"`) and populate step
(`ASSETS=$(ls *.duckdb)`) both matched the repo's Dockerfile.duckdb, so
the Dockerfile got shipped to R2 and pulled onto the volume. Then
serve-duckdb.sh's glob('/data/*.duckdb') picked it up, the plugin tried
to open the Dockerfile text as a DuckDB file, and datasette crashed on
startup — crash-loop, no logs after first boot.
Renaming to Dockerfile-duckdb (+ matching fly.duckdb.toml) removes the
class of bug at the root, no defensive filters needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Belongs with the previous rename commit; got dropped from it because the failed pathspec on the (now-renamed) Dockerfile.duckdb in my git add unstaged fly.duckdb.toml between add and commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Diagnosed by reading the staging machine directly: datasette starts fine, but its cold /-/databases.json over the real ~4GB of duckdb files on shared-cpu-1x takes 245s (measured). The smoke test's 30s per-request timeout was tripping on the in-progress scan. The SQLite track avoids this with --inspect-file; we don't have a DuckDB equivalent yet, so the runtime scan happens once per process. Insert a warmup step between Wait-for-SSH and Smoke that patiently hits /-/databases.json (600s timeout). After it completes the catalog is cached and smoke is instant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous warmup hit "Connection refused" 0.5s in: it tried databases.json while datasette was still attaching the 13 dbs (Wait-for-SSH only checks Fly's hallpass daemon, not datasette). Mirror smoke-test.sh's pattern — phase 1 polls versions.json until uvicorn is bound (up to 180s), phase 2 drives the cold scan (up to 600s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bootstrap is green: https://warehouse-duckdb.fly.dev is live, serving real warehouse data via the DuckDB backend (verified versions.json, databases.json after ~245s cold scan, nlrb query returning 2,070,197 docket rows). Restore: - on:push removed (back to schedule + workflow_dispatch only) - PROMOTE no longer set on push - Tear-down on failure restored (`(env.PROMOTE != 'true' || failure())`) Kept: the on-failure diagnostics step (drops machine logs + memory / process state into CI output before teardown). Without it the staging machine self-destructs and the next breakage is invisible — exactly the gap that made this bootstrap so iterative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the SQLite track's deploy.yml + refresh-data.yml structure: image
lifecycle in deploy-duckdb.yml (on push to main), data lifecycle in
refresh-data-duckdb.yml (on schedule). Code/serve-script iterations drop
from ~22 min to ~5 min because refresh no longer re-builds the image on
every run; data refreshes can also skip the image rebuild and just reuse
the current machine's image.
deploy-duckdb.yml (new) — directly modeled on deploy.yml:
build (flyctl deploy --build-only --push, pin digest)
-> roll onto role=current via flyctl machine update (retry for ~5 min
against registry-propagation 404s)
-> verify running /etc/build-sha matches GITHUB_SHA
-> warm up datasette (machine update restarts -> cold cache)
refresh-data-duckdb.yml — slimmed:
- Dropped the in-workflow `Build image` step.
- Dropped `Ensure app + IPs exist` (bootstrap is now manual one-time,
like the SQLite app — documented in the header).
- `Discover current machine + volume + image` now also captures
`.config.image`; staging machine boots from THAT, so the workflow
never builds.
- Added `Warm up promoted machine` step between Promote and Destroy,
since `flyctl machine update --port` restarts the machine and
drops the warm cache (same warmup script as the pre-smoke one).
- Simplified post-bootstrap guards (no more empty-OLD branch).
Bootstrap remains a one-time manual step: create the app + IPs + first
machine by hand (the warehouse-duckdb app already exists with a
role=current machine, so this is moot going forward).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-enable on-commit firing scoped to duckdb-parallel-build so we can iterate either path before merging: - deploy-duckdb.yml: push to branch fires the ~5min image roll - refresh-data-duckdb.yml: push to branch fires the full ~22min refresh (PROMOTE=true on push so the run leaves a live URL) Concurrency group warehouse-duckdb-deploy serializes them, so a push that touches both paths runs deploy first (5min), then refresh (using the just-rolled image). Revert all three TEMP blocks before merging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Install datasette from duckdb-deploy (the new merge branch = duckdb-backend ∪ no_limit_csv), which carries both the new `datasette inspect -c` flag (backend-neutral inspect for plugin-mounted dbs) and no_limit_csv's uncapped count(*). Refresh: - After convert, build inspect-data.json by mounting the local .duckdb files via plugin config and running `datasette inspect -c`. Counts are true totals (verified locally: docket 2,053,059 etc.). - Upload + populate inspect-data.json alongside the .duckdb files. - Drop both warmup steps (pre-smoke + post-promote) — first /-/databases.json drops from ~245s to ~20ms with the inspect file (measured locally). Deploy: drop the post-roll warmup step for the same reason. serve-duckdb.sh: pass --inspect-file when /data/inspect-data.json exists, matching serve.sh on the SQLite track. Removed scripts/warmup-datasette.sh (no longer used). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refresh hit smoke at 121s but datasette took 121s to bind :8080 over 13 attached duckdb dbs (plugin discovery + 13 startup-hook add_database + 13 check_databases on shared-cpu-1x). smoke-test.sh polls only 120s internally — tight miss. The fix isn't to bump the shared smoke script — it's explicit wait-for-ready in three spots: - before smoke (refresh) — so smoke runs against a bound datasette - after promote ports (refresh) — so we don't cordon the old machine while the new one is still starting (would leak 502s to visitors) - after rootfs verify (deploy) — so the workflow doesn't end while datasette is still starting (verify uses hallpass, not datasette) Plus bump fly.duckdb.toml health-check grace_period 120s -> 240s so Fly doesn't kill the machine mid-boot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls in fgregg/datasette@duckdb-deploy a2c98f1d (and its parent c788edea on backend-abstraction): use the dialect-escaped table name in the inspect-data cache check. With this, /nlrb/allegation should show the real count from inspect-data.json (727,373) instead of >10,000 estimate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the table.py + table.html companion to the cache check fix — SQL aliases count(*) to 'count' so the JS can read the column key generically (DuckDB returns 'count_star()', SQLite 'count(*)'). Without this the click-for-true-count link silently fails on DuckDB.
Docker layer caching meant pip install of git+...@duckdb-deploy reused
the cached layer across deploys (the URL string is identical even when
the branch tip moves) -- last two deploys silently rolled out the SAME
stale datasette code. /etc/build-sha matched but the datasette inside
was the previous one.
Resolve the current commit SHA of duckdb-deploy in the deploy workflow
and pass as --build-arg DATASETTE_REF=<sha>. The Dockerfile declares
ARG DATASETTE_REF=duckdb-deploy before the pip RUN and uses
${DATASETTE_REF} in the URL, so the layer key now changes whenever
datasette has new commits.
Warehouse overrides datasette's table.html via --template-dir, so the upstream fix on duckdb-deploy doesn't reach the rendered page — the warehouse copy still had data['rows'][0]['count(*)'] which is wrong on DuckDB (returns null since the column is named 'count_star()'). Match the upstream fix here. (The SQLite track is unaffected because SQLite returns 'count(*)' as the column name; the upstream change aliases to 'count' which is what this JS now reads.)
new_rc_cases:
datetime(created_at,'utc')||'Z' -> strftime(created_at,'%Y-%m-%dT%H:%M:%SZ')
-- created_at is TIMESTAMP in the converted .duckdb file (sqlite source
declared it that way) so strftime takes it directly.
new_lm20_filings:
Same datetime translation, but receiveDate's underlying SQLite type was
TEXT so the converter leaves it as VARCHAR. Use
strftime(try_cast(receiveDate as timestamp), ...)
to be type-agnostic. Once #9 (column type inference) lands the try_cast
becomes a no-op and we can drop it.
Also wrap in a CTE because DuckDB follows the SQL standard and doesn't
allow referencing SELECT aliases (atom_title / atom_content_html) in
WHERE; the outer SELECT against the CTE makes that legal.
Picks up fgregg/datasette@duckdb-deploy cb5bdc65: DateFacet now suggests date facets on DuckDB (was silently never suggested due to the SQLite-only glob probe) and facet_results uses try_cast so it doesn't hard-error on dirty VARCHAR date columns.
Drop --setting allow_facet off (inherited from serve.sh). The SQLite site disables faceting because it's too expensive on 10GB+ SQLite tables; DuckDB is columnar/vectorized and faceting is cheap (group-by over 11M rows ~11ms in profiling), so enable it and rely on facet_time_limit_ms 500 as the per-facet cost guard. This exposes the DateFacet dialect fix (datasette duckdb-deploy cb5bdc65) — date columns now suggest + compute date facets. A feature the SQLite deployment can't afford.
Picks up datasette duckdb-deploy 1cdfd3f6 — table schema / CREATE TABLE display now works for DuckDB-backed tables via duckdb_tables().sql.
#13) datasette duckdb-deploy: DuckDB keyless immutable tables now keyset- paginate (was offset) and mint no rowid row-page permalinks.
Every push fired both deploy and a full refresh; the refresh was immediately cancelled (code changes don't need a data rebuild), producing a cancelled run per push — pure CI noise. Drop the push trigger from refresh; deploy-duckdb.yml still fires on push for fast iteration. Dispatch refresh manually when data actually changes.
This reverts commit 772f306.
Parity with serve.sh (SQLite). The datasette-duckdb backend now supports cross-database queries: at startup the plugin re-backs _memory with DuckDB and ATTACHes every mounted .duckdb (READ_ONLY), so /_memory can join across the databases -- e.g. union_names_crosswalk against nlrb/f7/lm*. DuckDB has no SQLITE_LIMIT_ATTACHED cap, so all ~14 databases are cross-queryable (the SQLite site is limited to the first 10). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ships the same source data as the SQLite build, converted to DuckDB and deployed to a separate Fly app (
warehouse-duckdb) with its own machine, volume, and hostname. The SQLite production app (warehouse/ labordata.bunkum.us) is never touched by this workflow.What's here
.github/workflows/refresh-data-duckdb.yml—make→ convert each.dbto.duckdbviapython -m datasette_duckdb.convert→ upload to R2 under theduckdb/prefix → blue/green deploy. Self-bootstrapping: builds its own image and creates the first machine when none exists; reusespull-from-r2-direct.shandsmoke-test.shunchanged (the.duckdbfiles attach under the same names asmake's 13 targets). Runs daily at 07:30 (30 min after the SQLite refresh).Dockerfile.duckdb/scripts/serve-duckdb.sh— data-less image installing datasette +datasette-duckdbfrom theduckdb-backendbranch of the fork; attaches/data/*.duckdbimmutable, no--crossdb, no inspect-file.datasette.duckdb.yml— config without the sqlite-only canned queries.fly.duckdb.toml— separate app, own machine + volume (no[mounts], matchingfly.toml's reasoning).Prerequisites
git+https://github.com/fgregg/datasette@duckdb-backend— that branch is pushed and includes the converter fix to skip FTS virtual + shadow tables (verified onf7.db: 536,851 rows convert cleanly).FLY_API_TOKENandR2_*secrets.First run
Trigger Refresh data (DuckDB) via
workflow_dispatch. The bootstrap run creates thewarehouse-duckdbapp + IPs, builds + converts + ships, and promotes.Known gaps (non-blocking; tracked)
union_namescanned queries aren't ported (sqlite-specific).inspect-data.json(datasette inspectis sqlite-shaped).🤖 Generated with Claude Code