Skip to content

fix(server): use explicit STS credentials for iceberg secret#562

Merged
bill-ph merged 4 commits into
mainfrom
claude/wizardly-hawking-cbad13
May 15, 2026
Merged

fix(server): use explicit STS credentials for iceberg secret#562
bill-ph merged 4 commits into
mainfrom
claude/wizardly-hawking-cbad13

Conversation

@bill-ph
Copy link
Copy Markdown
Collaborator

@bill-ph bill-ph commented May 14, 2026

Summary

PR #560 wired the iceberg S3 secret to PROVIDER credential_chain. DuckDB's built-in credential chain does not support EKS Pod Identity (the comment on fetchAWSSDKCredentials calls this out explicitly), and on the PostHog multitenant deploy the worker pod has no AWS identity of its own at all — the duckgres-worker SA has no IRSA annotation, no Pod Identity association, no AWS env vars. So `CREATE SECRET` fails at validation time:

Invalid Configuration Error: Secret Validation Failure: during `create` using the following: Credential Chain: 'config'

DuckLake doesn't hit this because the control plane runs `STS:AssumeRole` on the per-tenant IAM role and ships the resulting short-lived credentials inline in the worker activation payload. The DuckLake S3 secret is then built with `PROVIDER config` and those explicit credentials (see `buildConfigSecret`).

The per-tenant IAM role rendered by the duckling Crossplane Composition already grants both `s3:` (data bucket) and `s3tables:` (iceberg table bucket) when `spec.iceberg.enabled=true`. The credentials in `cfg.DuckLake.S3AccessKey/SecretKey/SessionToken` are already authorized for S3 Tables — they just need to be threaded into the iceberg secret too.

What changed

  • `server/iceberg/migration.go` — `BuildIcebergSecretStmt` now requires `keyID`/`secret` (with optional `sessionToken`) and always emits `PROVIDER config` with the explicit credentials inlined. There is no fallback to `PROVIDER credential_chain` — the activator-managed STS-minted credentials are the only supported auth model.
  • `server/server.go` — `AttachIcebergCatalog` accepts the three credential strings and validates them up-front. If `ic.Enabled` but `keyID`/`secret` are empty, returns: `iceberg catalog enabled (table_bucket=...) but no AWS credentials in activation payload — control plane STS broker did not populate DuckLake S3 credentials`. All three call sites (initial connect, shared warm activation, passthrough) pass `cfg.DuckLake.S3AccessKey/SecretKey/SessionToken` — same credentials DuckLake uses on the same per-tenant role.
  • `server/iceberg/migration_test.go` — string-level coverage of the new SQL shape: with and without session token, region default, quote-escaping defense against attacker-controlled credential values.
  • `server/iceberg/integration_test.go` — DuckDB-level regression test (added pre-PR in `8213934` as a regression net for fix(server): use TYPE S3 credential_chain for iceberg s3_tables auth #560's bug class). Updated to drop the obsolete `credential chain fallback` subtest — that subtest exercised an empty-credential SQL form that no longer exists, and its assertion was so permissive it was logging-and-passing instead of testing anything meaningful.

Why no credential_chain fallback?

Earlier iterations of this PR included a fallback: when `keyID` was empty, emit a `PROVIDER credential_chain` secret as before. It was removed because:

  1. The fallback was never validated for iceberg — no test, no caller, no doc justified its existence.
  2. In production it would mask a real config bug (e.g., STS broker failure) behind a silent fallback that then fails opaquely at attach time anyway. Better to fail at the validation boundary with a message naming the affected tenant.
  3. The whole multitenant architecture is "control plane mints credentials, worker has no AWS identity." Falling back to "let DuckDB walk its chain on the worker" contradicts that architecture by design.

`buildCredentialChainSecret` (for DuckLake, in `server.go`) is a separate code path used by the Ansible single-tenant EC2 deploy (`posthog-cloud-infra/ansible/roles/duckgres/templates/duckgres.yaml.j2:36`) and is not affected by this PR. Iceberg never had a live deployment using credential_chain.

Known follow-up (not in this PR)

`RefreshS3Secret` (server.go:1906) rotates the DuckLake secret when STS credentials expire (~1h), but no equivalent yet rotates the iceberg secret. After this PR lands, a long-lived hot-idle worker will start 403'ing iceberg queries after the first STS rotation. Fix is a sibling `RefreshIcebergSecret` invoked from `duckdbservice/activation.go`'s refresh path. Scoped out here for an immediate follow-up PR.

Other follow-ups identified during this review:

  • No E2E coverage exists for iceberg. Five PRs shipped without any test that exercises session-activation-to-S3-Tables-query. Worth adding an `iceberg-test` skill (analog of `ducklake-test`) and an opt-in `TestIcebergE2E` integration test gated on `ICEBERG_E2E_TEST_BUCKET_ARN`.
  • `aws_sdk` S3 provider is dead code in all deployed configurations (zero refs in `charts` or `posthog-cloud-infra`). Safe to delete in a cleanup PR.
  • Slim `/warehouse/status` endpoint doesn't expose iceberg state — orthogonal QoL fix.

Test plan

  • `go test ./server/iceberg/...` — 6 string-comparison tests + 2 DuckDB-execution subtests all pass
  • `go test ./server/... ./controlplane/... ./duckdbservice/...` — full regression sweep, no failures
  • Deploy to dev, reconnect to `test-org-smoke-1778167994`, run `SHOW DATABASES` and confirm `iceberg` appears alongside `ducklake`, then `USE iceberg.main; CREATE TABLE smoke (id INTEGER); INSERT INTO smoke VALUES (1); SELECT * FROM smoke;`
  • Verify on dev worker logs that the emitted SQL is `PROVIDER config, KEY_ID 'ASIA...'`, not anything credential_chain shaped

🤖 Generated with Claude Code

bill-ph and others added 4 commits May 14, 2026 14:28
PR #560 wired the iceberg S3 secret to PROVIDER credential_chain, but
DuckDB's built-in credential chain does not support EKS Pod Identity (the
comment on fetchAWSSDKCredentials, server.go:2060, calls this out
explicitly). On the PostHog multitenant deploy the worker pod has no AWS
identity of its own at all — the duckgres-worker SA has no IRSA
annotation, no Pod Identity association, no AWS env vars. So CREATE
SECRET fails at validation time with:

    Invalid Configuration Error: Secret Validation Failure: during
    `create` using the following: Credential Chain: 'config'

DuckLake doesn't hit this because the control plane runs STS:AssumeRole
on the per-tenant IAM role and ships the resulting short-lived
credentials inline in the worker activation payload. The DuckLake S3
secret is then built with PROVIDER config and those explicit credentials
(see buildConfigSecret at server.go:1979).

The per-tenant IAM role rendered by the duckling Crossplane Composition
already grants both s3:* (on the data bucket) and s3tables:* (on the
iceberg table bucket) when spec.iceberg.enabled=true. The credentials in
cfg.DuckLake.S3AccessKey/SecretKey/SessionToken are therefore already
authorized for S3 Tables — they just need to be threaded into the
iceberg secret too.

Changes:
- server/iceberg/migration.go: BuildIcebergSecretStmt now accepts
  keyID/secret/sessionToken. When keyID is non-empty, emit PROVIDER
  config with the explicit credentials. When empty, fall back to
  PROVIDER credential_chain (preserves the standalone/host-AWS-identity
  path where DuckDB's chain can find env or IMDS credentials).
- server/server.go: AttachIcebergCatalog takes the same three strings
  and forwards them. All three call sites (initial connect, shared warm
  activation, passthrough) pass cfg.DuckLake.S3AccessKey/SecretKey/
  SessionToken — the same credentials DuckLake uses.
- server/iceberg/migration_test.go: cover the new SQL shape (with and
  without session token), the credential_chain fallback, region default,
  and quote-escaping on attacker-controlled credential values.

Known follow-up not addressed here: RefreshS3Secret (server.go:1906)
rotates the DuckLake secret when STS credentials expire (~1h), but the
iceberg secret is not refreshed alongside. Iceberg queries will start
failing with 403 after the first STS rotation on a long-lived hot-idle
worker. Fixing this needs a sibling RefreshIcebergSecret invoked from
duckdbservice/activation.go's refresh path; scoped out of this PR to
keep the diff focused on the immediate "iceberg attach fails on every
activation" bug.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds an in-memory DuckDB integration test that loads the iceberg
extension and exercises the actual CREATE SECRET DDL emitted by
BuildIcebergSecretStmt. No AWS network access needed — DuckDB just
parses and validates the secret.

This is the regression coverage that PR #560 was missing: the bug there
was a CREATE SECRET binder/validation error at the DuckDB layer, not
something the existing string-comparison unit tests could catch. The
new test would have failed on #560's old SQL form and would fail on any
future revert.

Explicit-credential cases (PROVIDER config with inlined KEY_ID/SECRET/
SESSION_TOKEN) are strict — must succeed cleanly because they don't
depend on any host-side AWS state. The credential_chain fallback case
tolerates a "Secret Validation Failure" outcome (which is exactly what
the original #560 bug produced on EKS workers and what reproduces
locally on hosts without AWS identity) while still catching parser/
binder/provider-mismatch errors that would indicate the SQL form
itself regressed.

The test self-skips if INSTALL/LOAD iceberg fails — covers air-gapped
sandbox environments without losing coverage where it can run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Removes the credential_chain branch introduced in the previous commit on
this PR. The fallback was hedging — it dressed up the broken DuckDB chain
path (the very thing this PR exists to fix) as forward-compat. In the
multitenant deploy the worker pod has no AWS identity, so a silent
fallback would just defer the same failure to attach time and obscure
the real misconfiguration.

Explicit STS credentials minted by the control plane are the only
supported auth model. If iceberg.Enabled is true and the activation
payload lacks credentials, that is a configuration bug (broken STS
broker, missing per-tenant IAM role, race in the activator) and should
surface as a clear error at attach time, not a credential_chain SQL that
will fail in DuckDB with an opaque "Secret Validation Failure".

Changes:
- BuildIcebergSecretStmt: no more empty-keyID branch; always emit
  PROVIDER config.
- AttachIcebergCatalog: validate keyID/secret are non-empty when iceberg
  is enabled; return an error naming the table_bucket so logs identify
  which tenant is affected.
- Tests: drop the credential_chain-fallback case; keep the with/without
  session-token, region default, and quote-escaping coverage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The integration test added in 8213934 had a "credential chain fallback"
subtest that exercised BuildIcebergSecretStmt with empty credentials and
asserted DuckDB's response wasn't one of a handful of "fatal" parser
errors. That made sense when empty creds produced a `PROVIDER
credential_chain` secret. After this PR's removal of that fallback,
empty creds produce `PROVIDER config, KEY_ID '', SECRET ''` instead —
DuckDB accepts that without raising any of the listed fatal signals
(it either creates a useless empty-cred secret or fails with a
non-matching error string), so the subtest logged-and-passed while no
longer testing anything meaningful.

The strict explicit-credential subtests still cover what matters: the
SQL DuckDB actually sees on the only live code path. Removing the dead
subtest also removes the `strings` import since no other place uses it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@bill-ph bill-ph merged commit 8e436ab into main May 15, 2026
22 checks passed
@bill-ph bill-ph deleted the claude/wizardly-hawking-cbad13 branch May 15, 2026 16:18
bill-ph added a commit that referenced this pull request May 15, 2026
…563)

Follow-up to #562. That PR fixed the initial-attach path so iceberg
works at session activation, but the hot-idle credential refresh path
in duckdbservice/activation.go only rotates the DuckLake S3 secret
(ducklake_s3). The iceberg secret (iceberg_sigv4) shares the same
STS-minted credentials but was never being re-emitted, so iceberg
queries on a long-lived worker would 403 ~1h after activation while
DuckLake stayed fresh.

Changes:

- server/server.go: new RefreshIcebergSecret. Mirrors RefreshS3Secret —
  no-op when iceberg is disabled, rejects empty credentials with the
  same error shape as AttachIcebergCatalog (no silent fallback to a
  credential_chain that we removed in #562), re-emits the iceberg_sigv4
  secret via the existing BuildIcebergSecretStmt, and handles DuckDB's
  "Current transaction is aborted" recovery the same way the S3 refresh
  does.

  Importantly, RefreshIcebergSecret does NOT short-circuit when the
  iceberg catalog is already attached (unlike AttachIcebergCatalog). The
  whole point of refresh is to overwrite the existing secret in place;
  DuckDB resolves the secret at request time so the new credentials
  take effect for the next iceberg query without an explicit reattach.

- duckdbservice/service.go: new refreshIcebergSecret indirection on
  SessionPool, defaulted to server.RefreshIcebergSecret. Same shape as
  refreshS3Secret so tests can inject a stub.

- duckdbservice/activation.go (reuseExistingActivation): two changes.

  1. The needsRefresh predicate now triggers when Iceberg.Enabled is
     true *or* DuckLake.ObjectStore is set. Previously, an iceberg-only
     tenant (DuckLake metadata-only, S3 bucket empty, iceberg enabled)
     would skip refresh entirely because ObjectStore=="" gated it out —
     even though Iceberg.Enabled=true and the credentials WERE in the
     payload (the activator populates DuckLake.S3* from STS regardless
     of whether DuckLake itself uses S3).

  2. The refresh block now invokes both refresh functions, each gated
     on its own enable check. They share the same DuckLake.S3* triple
     because the per-tenant IAM role grants both s3:* and s3tables:*
     and STS returns one credential bundle covering both.

- server/iceberg_refresh_test.go: three new regression tests run in the
  server package (so they can exercise the exported RefreshIcebergSecret
  while the iceberg subpackage stays free of server-package imports).

  * TestRefreshIcebergSecretRotatesCredentials uses an in-memory DuckDB
    with the iceberg extension to verify back-to-back CREATE OR REPLACE
    with different credentials both succeed — the exact behavior STS
    rotation depends on. Skips when the extension can't be installed
    (air-gapped CI), same posture as the existing integration test.
  * TestRefreshIcebergSecretNoOpWhenDisabled confirms the activation
    layer can call this unconditionally without harm when the tenant
    hasn't opted in to iceberg.
  * TestRefreshIcebergSecretRejectsEmptyCredentials locks in the
    "fail loud, no silent fallback" invariant for the refresh path,
    matching what #562 established for the attach path.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant