Skip to content

Make fork compatible with DuckDB v1.5.3#21

Merged
fuziontech merged 24 commits into
v1.5.3-basefrom
posthog/v1.5.3
May 27, 2026
Merged

Make fork compatible with DuckDB v1.5.3#21
fuziontech merged 24 commits into
v1.5.3-basefrom
posthog/v1.5.3

Conversation

@fuziontech

Copy link
Copy Markdown
Member

Rebases the PostHog DuckLake patches onto upstream's v1.5-variegata line, which pins DuckDB v1.5.3 (up from v1.5.2).

What's here

Replays our patches onto the v1.5.3 base:

Reconciliation notes

Verification

  • Builds clean against DuckDB v1.5.3 (make GEN=ninja, vcpkg toolchain for the roaring dep).
  • Basic DuckLake smoke test passes (attach / create / insert / query / snapshots()); PRAGMA version reports v1.5.3 (Variegata).
  • Not yet exercised locally: the Postgres passthrough tests (postgres_metadata_passthrough.test) need a Postgres backend — relying on CI here.

🤖 Generated with Claude Code

bill-ph and others added 8 commits May 26, 2026 14:23
Speed up CI via two independent levers:

1. MainDistributionPipeline.yml: route the linux_amd64 / linux_arm64
   build matrix entries to 16-core Depot managed runners
   (depot-ubuntu-24.04-16 / depot-ubuntu-24.04-arm-16) via the upstream
   workflow's runners JSON input. Cuts the build from ~25-30min to
   ~5-10min, including cold-cache cases.

   Requires PostHog/ducklake to be added to the Depot GitHub App
   installation; until then the build jobs will queue-time-out at 10m.

2. Five per-PR test workflows (Catalogs, ConfigTests, DeletionVectors,
   MinIO, NoInline): bump hendrikmuhs/ccache-action max-size from the
   500MB default to 1.5G. DuckDB's full release build produces 1-2 GB
   of object files, so the default forces aggressive LRU eviction and
   keeps effective hit rate at 30-50%. 1.5G fits most of the build,
   pushing warm-cache hit rates into the 80-90% range.

   5 caches x 1.5G = 7.5 GB, leaving headroom in the 10 GB GitHub
   Actions cache budget per repo. Cold-cache cases (DuckDB submodule
   bumps) will still take ~25-30min but are infrequent.

Debug.yml is now nightly-only after #9 and is left alone.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* Route Postgres metadata reads through passthrough

* Fix Postgres metadata passthrough regressions

* Keep Postgres thread cache setting test-scoped

* Fix Postgres test pool setting init order

* Update postgres scanner for pool defaults
Rebasing the Postgres metadata passthrough patch (#5) onto DuckDB
v1.5.3 replayed an old DuckLakeMetadataManager::Query(string&) body
(returning transaction.Query(query)) as a second definition alongside
upstream's new implementation, which uses SubstituteCatalogPlaceholders
+ transaction.ExecuteRaw. Drop the stale duplicate; upstream's version
is correct for the v1.5.3 metadata-manager architecture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The v1.5.3 workflows inherited from upstream use tag/branch refs
(actions/checkout@v4, ccache-action@main, etc.), which the PostHog org
ruleset rejects ("all actions must be pinned to a full-length commit
SHA"). Pin every action to its commit SHA, reusing the SHAs vetted in
#17 on main where applicable. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fuziontech fuziontech changed the base branch from v1.5-variegata to v1.5.3-base May 26, 2026 21:48
PostgresMetadataManager::ExecuteQuery builds a `CALL postgres_query(...)`
passthrough statement and ran it via transaction.Query(). In DuckDB
v1.5.3's metadata architecture transaction.Query() delegates back to
metadata_manager->Query(), i.e. PostgresMetadataManager::Query(), which
calls ExecuteQuery() again and wraps the statement in yet another
CALL postgres_query(...) — recursing without bound and hanging the
"Test Postgres" suite (SQLite, which has no passthrough, was unaffected).

Run the fully-formed passthrough call via transaction.ExecuteRaw() so it
goes straight to the metadata connection without re-entering the manager.
This is the load-bearing change that #19 made (as RawQuery) and that was
lost when #19 was skipped during the v1.5.3 rebase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fuziontech and others added 15 commits May 27, 2026 07:29
The Postgres metadata passthrough (#5/#11) depends on a read/write/raw
routing split that #19 introduced and that was dropped during the v1.5.3
rebase. v1.5.3 had collapsed all metadata access onto transaction.Query,
which for a Postgres backend wrongly routed everything through the
postgres_query passthrough — breaking DDL (uuid()), the metadata-exists
probe, local utility queries (ATTACH, duckdb_secrets), and recursing on
the passthrough CALL itself.

Restore the split, adapted to v1.5.3's manager-side substitution and
ExecuteRaw executor:

- Transaction-level Execute/SnapshotQuery/CurrentQuery/RawQuery run RAW on
  the metadata connection (DuckDB executes against the attached Postgres
  catalog, handling uuid()/LIST/STRUCT and queries that embed
  postgres_query CTEs).
- Metadata-manager-level Execute/SnapshotQuery/CurrentQuery do the
  postgres_execute/postgres_query passthrough, used (unqualified) only by
  the snapshot-pinned file/stats reads that benefit from server-side
  execution (#5/#11). Simple committed reads (e.g. global stats, inlined
  file deletions) also use the manager passthrough.
- Call sites relabeled by intent: writes -> Execute, snapshot reads ->
  SnapshotQuery, current-state reads -> CurrentQuery, local utilities ->
  RawQuery. Notably ReadInlinedFileDeletions and other inlined-deletion
  reads were running through Execute (postgres_execute returns no rows),
  which double-applied inlined deletions.
- Port the check_metadata_query_intent.py guard and wire it into CI.

Verified locally against DuckDB, SQLite and Postgres backends
(postgres_query/postgres_execute via Docker): all three suites green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DuckDB v1.5.3 rejects multiple streaming scans, or a streaming scan
alongside an insert/CTAS, in one query. The Postgres metadata reads that
embed postgres_query() run raw, so an un-materialized postgres_query
streams:

- GetLatestSnapshotQuery() is evaluated during write statements, so its
  streaming postgres_query scan collided with the insert ("Failed to
  query most recent snapshot ... Multiple streaming scans"). Wrap it in a
  MATERIALIZED CTE.
- The file-column-stats CTE used NOT MATERIALIZED for single-reference
  columns; multiple such columns produced multiple streaming
  postgres_query scans. Always mark it MATERIALIZED.

Negligible cost (small metadata results). Only surfaces with a streaming
postgres_scanner build (CI); a materializing local build masked it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The MATERIALIZED-CTE attempt didn't help: an embedded
"SELECT * FROM postgres_query(...)" is a streaming scan, and GetSnapshot()
runs during write statements, so it tripped DuckDB v1.5.3's "streaming
scan + insert in the same query" restriction. Wrapping it in a
MATERIALIZED CTE just turned it into "streaming scan + CTAS".

Use the metadata-manager passthrough instead, which executes
CALL postgres_query(...) as a separate statement returning a materialized
result (the same path GetSnapshot(at_clause) already uses and that global
stats use successfully in CI):

- GetSnapshot(): call the manager-level CurrentQuery() (passthrough)
  instead of transaction.CurrentQuery() (raw embedded scan).
- Drop the Postgres GetLatestSnapshotQuery() override so the plain base
  SELECT is used; the CALL passthrough wraps it.
- Revert the file-column-stats CTE back to its original MATERIALIZED hint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PostHog org ruleset rejects any unpinned action ref, enforced
transitively into reusable workflows. duckdb/extension-ci-tools uses
unpinned internal actions (actions/checkout@v4, ...), so every run
startup-failed at "Prepare all required actions".

Point the distribution call at the PostHog extension-ci-tools fork,
whose main (d3e0ee09) pins every internal action to a SHA and exposes
the `runners` input, and pull ci-tools makefiles from the same fork/ref
via override_ci_tools_repository.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The depot arm64 runner was unresponsive (job lost at the 10-minute
runner timeout). This repo is public, so use the free GitHub-hosted
ubuntu-24.04-arm runner for arm64; keep amd64 on the depot runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DuckLake metadata reads (e.g. GetCatalogForSnapshot) scan several attached
Postgres tables in one statement using DuckDB LIST/STRUCT syntax, so they
must run on the metadata connection rather than server-side. DuckDB v1.5.3
rejects multiple streaming postgres_scanner scans (and streaming + insert)
in a single query, which these multi-table reads triggered.

postgres_scanner only streams a scan when max_threads > 1, where
max_threads = pages / pg_pages_per_task. After attaching the Postgres
metadata catalog, set pg_pages_per_task to a very large value so
max_threads is pinned to 1 and these reads materialize instead of
streaming. Metadata results are tiny, so there is no throughput cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#5 bumped postgres_scanner forward to dd71d196 (the "Connection pool"
rework, which adds the database-connector submodule) on the DuckDB 1.5.2
line. That newer scanner is incompatible with DuckDB v1.5.3's restriction
on streaming postgres scans: DuckLake's multi-table metadata reads
(GetCatalogForSnapshot, ...) failed with "Multiple streaming scans or
streaming scans + CTAS / insert in the same query are not currently
supported".

Pin postgres_scanner to c0e9256 -- the exact version upstream DuckLake's
v1.5-variegata (DuckDB v1.5.3) is tested against -- and drop the
ineffective pg_pages_per_task workaround. Revisit the connection-pool
scanner bump separately against a v1.5.3-compatible scanner if needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the remaining Postgres CI failures: CI built an older
postgres_scanner that streams multi-table scans, which DuckDB v1.5.3
rejects ("Multiple streaming scans or streaming scans + CTAS / insert in
the same query are not currently supported") for DuckLake's metadata
reads (GetCatalogForSnapshot scans 4 metadata tables in one query).

6b2b12c is the postgres_scanner the DuckDB v1.5.3 extension release ships;
it materializes those scans instead of streaming. It is 29 commits ahead
of #5's dd71d196 (so it keeps the connection-pool work) and 79 ahead of
c0e9256. Verified: the full Postgres test suite passes locally against
this scanner (428 test cases, 0 failures), alongside the green DuckDB and
SQLite suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the persistent Postgres CI failures: the test job set
ENABLE_POSTGRES_SCANNER=ON and statically built postgres_scanner from
source. That source build streams multi-table postgres scans, which
DuckDB v1.5.3 rejects ("Multiple streaming scans or streaming scans +
CTAS / insert in the same query are not currently supported") for
DuckLake's metadata reads (GetCatalogForSnapshot scans 4 tables at once).
The officially released postgres_scanner materializes those scans.

Stop building postgres_scanner from source for the test job; install the
released extension (INSTALL postgres FROM core) instead, exactly as a real
deployment would load it. The full Postgres suite passes locally with the
prebuilt extension (428 test cases, 0 failures). SQLite/Quack still build
from source as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GetCatalogForSnapshot runs six snapshot-pinned reads via the raw
transaction-level SnapshotQuery, which scans the attached metadata
catalog directly. For the quack backend that produces multiple
streaming quack_query scans in one statement, and quack's optimizer
rejects that ("Multiple streaming scans ... not currently supported"),
failing 350/394 tests in the Test Quack step.

Add a SnapshotCatalogQuery seam on the metadata manager. The base
implementation keeps the existing raw behavior (correct for the DuckDB
backend, and for Postgres where postgres_scanner materializes the
multi-table scan). QuackMetadataManager overrides it to route through
CALL quack_query_by_name, which executes the DuckDB SQL server-side in
a real DuckDB and returns a materialized result.

Verified locally on the DuckDB, Postgres, and SQLite backends; quack is
verified via CI (cannot build the quack variant locally). Intent guard
still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first CI run after the GetCatalogForSnapshot fix moved the failure
forward: passes went 44 -> 113, the table-information error disappeared,
and the next raw multi-table reads surfaced the same quack streaming-scan
error (500 data-file-list, 220 extended-file-list, 16 column lookup, 16
orphan-cleanup).

Generalize the seam introduced for GetCatalogForSnapshot: add a
CurrentCatalogQuery counterpart to SnapshotCatalogQuery (both default to
the raw transaction read; quack overrides both to its quack_query_by_name
passthrough) and route the remaining raw multi-table reads through them:
GetFilesForTable, the extended file list, IsColumnCreatedWithTable,
GetOrphanFilesForCleanup, and GetTableSizes.

Single-table raw reads are left as-is: a single streaming quack scan is
allowed by quack's optimizer. The default keeps DuckDB/Postgres/SQLite on
the raw path (verified: stats, compaction, alter, remove_orphans,
add_files pass on all three backends). Intent guard still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI no longer builds postgres_scanner from source (the Postgres test job
installs the released extension via `INSTALL postgres FROM core`), so the
GIT_TAG pin here is only exercised by local/manual builds that set
ENABLE_POSTGRES_SCANNER. Document that and why the pin matters, so the
file doesn't read as the active CI scanner version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fuziontech fuziontech merged commit eae3308 into v1.5.3-base May 27, 2026
34 checks passed
@fuziontech fuziontech deleted the posthog/v1.5.3 branch May 27, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants