Make fork compatible with DuckDB v1.5.3#21
Merged
Conversation
Speed up CI via two independent levers: 1. MainDistributionPipeline.yml: route the linux_amd64 / linux_arm64 build matrix entries to 16-core Depot managed runners (depot-ubuntu-24.04-16 / depot-ubuntu-24.04-arm-16) via the upstream workflow's runners JSON input. Cuts the build from ~25-30min to ~5-10min, including cold-cache cases. Requires PostHog/ducklake to be added to the Depot GitHub App installation; until then the build jobs will queue-time-out at 10m. 2. Five per-PR test workflows (Catalogs, ConfigTests, DeletionVectors, MinIO, NoInline): bump hendrikmuhs/ccache-action max-size from the 500MB default to 1.5G. DuckDB's full release build produces 1-2 GB of object files, so the default forces aggressive LRU eviction and keeps effective hit rate at 30-50%. 1.5G fits most of the build, pushing warm-cache hit rates into the 80-90% range. 5 caches x 1.5G = 7.5 GB, leaving headroom in the 10 GB GitHub Actions cache budget per repo. Cold-cache cases (DuckDB submodule bumps) will still take ~25-30min but are infrequent. Debug.yml is now nightly-only after #9 and is left alone. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* Route Postgres metadata reads through passthrough * Fix Postgres metadata passthrough regressions * Keep Postgres thread cache setting test-scoped * Fix Postgres test pool setting init order * Update postgres scanner for pool defaults
Rebasing the Postgres metadata passthrough patch (#5) onto DuckDB v1.5.3 replayed an old DuckLakeMetadataManager::Query(string&) body (returning transaction.Query(query)) as a second definition alongside upstream's new implementation, which uses SubstituteCatalogPlaceholders + transaction.ExecuteRaw. Drop the stale duplicate; upstream's version is correct for the v1.5.3 metadata-manager architecture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The v1.5.3 workflows inherited from upstream use tag/branch refs
(actions/checkout@v4, ccache-action@main, etc.), which the PostHog org
ruleset rejects ("all actions must be pinned to a full-length commit
SHA"). Pin every action to its commit SHA, reusing the SHAs vetted in
#17 on main where applicable. No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PostgresMetadataManager::ExecuteQuery builds a `CALL postgres_query(...)` passthrough statement and ran it via transaction.Query(). In DuckDB v1.5.3's metadata architecture transaction.Query() delegates back to metadata_manager->Query(), i.e. PostgresMetadataManager::Query(), which calls ExecuteQuery() again and wraps the statement in yet another CALL postgres_query(...) — recursing without bound and hanging the "Test Postgres" suite (SQLite, which has no passthrough, was unaffected). Run the fully-formed passthrough call via transaction.ExecuteRaw() so it goes straight to the metadata connection without re-entering the manager. This is the load-bearing change that #19 made (as RawQuery) and that was lost when #19 was skipped during the v1.5.3 rebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jghoman
approved these changes
May 27, 2026
The Postgres metadata passthrough (#5/#11) depends on a read/write/raw routing split that #19 introduced and that was dropped during the v1.5.3 rebase. v1.5.3 had collapsed all metadata access onto transaction.Query, which for a Postgres backend wrongly routed everything through the postgres_query passthrough — breaking DDL (uuid()), the metadata-exists probe, local utility queries (ATTACH, duckdb_secrets), and recursing on the passthrough CALL itself. Restore the split, adapted to v1.5.3's manager-side substitution and ExecuteRaw executor: - Transaction-level Execute/SnapshotQuery/CurrentQuery/RawQuery run RAW on the metadata connection (DuckDB executes against the attached Postgres catalog, handling uuid()/LIST/STRUCT and queries that embed postgres_query CTEs). - Metadata-manager-level Execute/SnapshotQuery/CurrentQuery do the postgres_execute/postgres_query passthrough, used (unqualified) only by the snapshot-pinned file/stats reads that benefit from server-side execution (#5/#11). Simple committed reads (e.g. global stats, inlined file deletions) also use the manager passthrough. - Call sites relabeled by intent: writes -> Execute, snapshot reads -> SnapshotQuery, current-state reads -> CurrentQuery, local utilities -> RawQuery. Notably ReadInlinedFileDeletions and other inlined-deletion reads were running through Execute (postgres_execute returns no rows), which double-applied inlined deletions. - Port the check_metadata_query_intent.py guard and wire it into CI. Verified locally against DuckDB, SQLite and Postgres backends (postgres_query/postgres_execute via Docker): all three suites green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DuckDB v1.5.3 rejects multiple streaming scans, or a streaming scan
alongside an insert/CTAS, in one query. The Postgres metadata reads that
embed postgres_query() run raw, so an un-materialized postgres_query
streams:
- GetLatestSnapshotQuery() is evaluated during write statements, so its
streaming postgres_query scan collided with the insert ("Failed to
query most recent snapshot ... Multiple streaming scans"). Wrap it in a
MATERIALIZED CTE.
- The file-column-stats CTE used NOT MATERIALIZED for single-reference
columns; multiple such columns produced multiple streaming
postgres_query scans. Always mark it MATERIALIZED.
Negligible cost (small metadata results). Only surfaces with a streaming
postgres_scanner build (CI); a materializing local build masked it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The MATERIALIZED-CTE attempt didn't help: an embedded "SELECT * FROM postgres_query(...)" is a streaming scan, and GetSnapshot() runs during write statements, so it tripped DuckDB v1.5.3's "streaming scan + insert in the same query" restriction. Wrapping it in a MATERIALIZED CTE just turned it into "streaming scan + CTAS". Use the metadata-manager passthrough instead, which executes CALL postgres_query(...) as a separate statement returning a materialized result (the same path GetSnapshot(at_clause) already uses and that global stats use successfully in CI): - GetSnapshot(): call the manager-level CurrentQuery() (passthrough) instead of transaction.CurrentQuery() (raw embedded scan). - Drop the Postgres GetLatestSnapshotQuery() override so the plain base SELECT is used; the CALL passthrough wraps it. - Revert the file-column-stats CTE back to its original MATERIALIZED hint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PostHog org ruleset rejects any unpinned action ref, enforced transitively into reusable workflows. duckdb/extension-ci-tools uses unpinned internal actions (actions/checkout@v4, ...), so every run startup-failed at "Prepare all required actions". Point the distribution call at the PostHog extension-ci-tools fork, whose main (d3e0ee09) pins every internal action to a SHA and exposes the `runners` input, and pull ci-tools makefiles from the same fork/ref via override_ci_tools_repository. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The depot arm64 runner was unresponsive (job lost at the 10-minute runner timeout). This repo is public, so use the free GitHub-hosted ubuntu-24.04-arm runner for arm64; keep amd64 on the depot runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DuckLake metadata reads (e.g. GetCatalogForSnapshot) scan several attached Postgres tables in one statement using DuckDB LIST/STRUCT syntax, so they must run on the metadata connection rather than server-side. DuckDB v1.5.3 rejects multiple streaming postgres_scanner scans (and streaming + insert) in a single query, which these multi-table reads triggered. postgres_scanner only streams a scan when max_threads > 1, where max_threads = pages / pg_pages_per_task. After attaching the Postgres metadata catalog, set pg_pages_per_task to a very large value so max_threads is pinned to 1 and these reads materialize instead of streaming. Metadata results are tiny, so there is no throughput cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#5 bumped postgres_scanner forward to dd71d196 (the "Connection pool" rework, which adds the database-connector submodule) on the DuckDB 1.5.2 line. That newer scanner is incompatible with DuckDB v1.5.3's restriction on streaming postgres scans: DuckLake's multi-table metadata reads (GetCatalogForSnapshot, ...) failed with "Multiple streaming scans or streaming scans + CTAS / insert in the same query are not currently supported". Pin postgres_scanner to c0e9256 -- the exact version upstream DuckLake's v1.5-variegata (DuckDB v1.5.3) is tested against -- and drop the ineffective pg_pages_per_task workaround. Revisit the connection-pool scanner bump separately against a v1.5.3-compatible scanner if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the remaining Postgres CI failures: CI built an older
postgres_scanner that streams multi-table scans, which DuckDB v1.5.3
rejects ("Multiple streaming scans or streaming scans + CTAS / insert in
the same query are not currently supported") for DuckLake's metadata
reads (GetCatalogForSnapshot scans 4 metadata tables in one query).
6b2b12c is the postgres_scanner the DuckDB v1.5.3 extension release ships;
it materializes those scans instead of streaming. It is 29 commits ahead
of #5's dd71d196 (so it keeps the connection-pool work) and 79 ahead of
c0e9256. Verified: the full Postgres test suite passes locally against
this scanner (428 test cases, 0 failures), alongside the green DuckDB and
SQLite suites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the persistent Postgres CI failures: the test job set
ENABLE_POSTGRES_SCANNER=ON and statically built postgres_scanner from
source. That source build streams multi-table postgres scans, which
DuckDB v1.5.3 rejects ("Multiple streaming scans or streaming scans +
CTAS / insert in the same query are not currently supported") for
DuckLake's metadata reads (GetCatalogForSnapshot scans 4 tables at once).
The officially released postgres_scanner materializes those scans.
Stop building postgres_scanner from source for the test job; install the
released extension (INSTALL postgres FROM core) instead, exactly as a real
deployment would load it. The full Postgres suite passes locally with the
prebuilt extension (428 test cases, 0 failures). SQLite/Quack still build
from source as before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GetCatalogForSnapshot runs six snapshot-pinned reads via the raw
transaction-level SnapshotQuery, which scans the attached metadata
catalog directly. For the quack backend that produces multiple
streaming quack_query scans in one statement, and quack's optimizer
rejects that ("Multiple streaming scans ... not currently supported"),
failing 350/394 tests in the Test Quack step.
Add a SnapshotCatalogQuery seam on the metadata manager. The base
implementation keeps the existing raw behavior (correct for the DuckDB
backend, and for Postgres where postgres_scanner materializes the
multi-table scan). QuackMetadataManager overrides it to route through
CALL quack_query_by_name, which executes the DuckDB SQL server-side in
a real DuckDB and returns a materialized result.
Verified locally on the DuckDB, Postgres, and SQLite backends; quack is
verified via CI (cannot build the quack variant locally). Intent guard
still passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first CI run after the GetCatalogForSnapshot fix moved the failure forward: passes went 44 -> 113, the table-information error disappeared, and the next raw multi-table reads surfaced the same quack streaming-scan error (500 data-file-list, 220 extended-file-list, 16 column lookup, 16 orphan-cleanup). Generalize the seam introduced for GetCatalogForSnapshot: add a CurrentCatalogQuery counterpart to SnapshotCatalogQuery (both default to the raw transaction read; quack overrides both to its quack_query_by_name passthrough) and route the remaining raw multi-table reads through them: GetFilesForTable, the extended file list, IsColumnCreatedWithTable, GetOrphanFilesForCleanup, and GetTableSizes. Single-table raw reads are left as-is: a single streaming quack scan is allowed by quack's optimizer. The default keeps DuckDB/Postgres/SQLite on the raw path (verified: stats, compaction, alter, remove_orphans, add_files pass on all three backends). Intent guard still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI no longer builds postgres_scanner from source (the Postgres test job installs the released extension via `INSTALL postgres FROM core`), so the GIT_TAG pin here is only exercised by local/manual builds that set ENABLE_POSTGRES_SCANNER. Document that and why the pin matters, so the file doesn't read as the active CI scanner version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rebases the PostHog DuckLake patches onto upstream's
v1.5-variegataline, which pins DuckDB v1.5.3 (up from v1.5.2).What's here
Replays our patches onto the v1.5.3 base:
GetGlobalTableStats(snapshot, table_id)signatureReconciliation notes
DuckLakeMetadataManager::Query(string&)that the rebase replayed against upstream's newSubstituteCatalogPlaceholders/ExecuteRawdesign.ExecuteRaw, moved placeholder substitution into the manager), so Clarify DuckLake metadata query intent #19 needs re-deriving against the new architecture rather than merging — tracked as a separate follow-up.Verification
make GEN=ninja, vcpkg toolchain for theroaringdep).snapshots());PRAGMA versionreports v1.5.3 (Variegata).postgres_metadata_passthrough.test) need a Postgres backend — relying on CI here.🤖 Generated with Claude Code