fix: Apply CockroachDB->Postgres DDL adapter in tenant provisioner#2133
fix: Apply CockroachDB->Postgres DDL adapter in tenant provisioner#2133
Conversation
Two related fixes so the develop deploy can fully recover after a broken provisioning run: 1. Export internal/migrations.AdaptCockroachDDLForPostgres and call it from services/tenant/provisioner/migration_runner.go when DB_DRIVER=postgres. Previously the adapter was only applied by the CLI --migrate path that runs against public schemas; the tenant provisioner applied CockroachDB-native DDL directly to tenant schemas and hit "cannot drop index because constraint requires it" on Postgres for migrations like control-plane/ 20260401000002_migrate_version_to_varchar.sql and reference-data/ 20260127000001_fix_platform_saga_unique_constraint.sql. Tenant provisioning now gets the same Postgres compatibility treatment. 2. Update .github/workflows/deploy-develop.yml to DROP SCHEMA IF EXISTS ... CASCADE for every org_* schema across all service databases before forcing re-provisioning. The previous reset cleared tenant.status and service_schemas metadata but left physical tables from the broken run in place. Migrations that rename tables (e.g. internal-account/20260225 rename_to_internal_ account) leave the tenant schema in a state that later reruns cannot reconcile because CREATE TABLE succeeds creating a fresh empty table next to the existing renamed one. Dropping and recreating from scratch matches the E2E workflow path which is known to pass. Internal aliasing: the provisioner package already has a local type named "migration", so the internal/migrations import is aliased to internalmigrations to avoid shadowing. Refactor: extracted applyMigrationInTransaction helper out of applyMigrationList to keep cognitive complexity under the architecture baseline (gocognit limit).
📝 WalkthroughWalkthroughDeployment workflow now stops the app container, drops physical Changes
Sequence Diagram(s)sequenceDiagram
participant GH as GitHub Actions
participant Host as Docker Host
participant Container as meridian-develop (app)
participant Postgres as postgres-develop
participant DBs as Service DBs
participant Provisioner as Provisioning Worker
GH->>Host: stop Container
Host->>Container: docker compose stop
GH->>Postgres: for each SERVICE_DB -> query pg_namespace for org_<tenant>
Postgres->>DBs: execute DROP SCHEMA IF EXISTS org_<tenant> CASCADE
Postgres->>Postgres: UPDATE tenant.status / tenant_provisioning (clear service_schemas, set pending)
Host->>Container: docker compose start
Container->>Provisioner: boot / worker starts
Provisioner->>Postgres: read tenant.status and tenant_provisioning
Provisioner->>Provisioner: enqueue/process provisioning based on reset state
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Claude Code ReviewCommit: 08dd70d | CI: running (most checks still pending) Solid bug fix that closes two related gaps in the develop deploy pipeline. The first commit correctly exports AdaptCockroachDDLForPostgres and wires it into the tenant provisioner. The second commit (responding to CodeRabbit) fixes a real correctness issue: applying the adapter per-statement after splitting rather than on the full SQL blob, and eliminates a race condition in the deploy workflow by stopping the app before resetting state. The code is well-structured, the test pinning the split-then-adapt ordering is valuable, and the deploy workflow comments clearly explain the sequencing rationale. Risk Assessment: All Low. Affects develop deploy only (production uses CockroachDB natively). Safe to revert. No schema migrations. Not hot path. Self-contained to provisioner and deploy workflow. Findings: Note - migration_runner.go:310 - DriverFromEnv() gate relies on DB_DRIVER env var; if unset, adapter silently skips. Not a real risk since the entire system depends on this same env var. Consistent with CLI path in runner.go:405. Informational only. Bot Review Notes: CodeRabbit thread on migration_runner.go:320 (unresolved): Split ordering concern (Critical) - Fully addressed in commit 08dd70d. Code now splits first via splitSQLStatements, then adapts each statement individually. New test TestPostgresAdapter_PerStatementAdaptation_PreservesDOBlocks pins this ordering. DriverFromEnv() defaulting concern - Still present but disagree this is a bug. PostgresProvisioner only runs in environments where DB_DRIVER=postgres is configured. If env var is missing, the entire system would be wrong, not just this adapter. Pattern is identical to CLI path in runner.go:405. CodeRabbit thread on deploy-develop.yml:314 (resolved by CodeRabbit itself): Stop-before-reset concern addressed in commit 08dd70d. Workflow now does stop, reset state + drop schemas, then start. |
There was a problem hiding this comment.
Clean fix. Well-scoped, correctly coupled, minimal blast radius. See summary comment for details. One minor suggestion on migration_runner.go:299 — DriverFromEnv() is called per-migration inside applyMigrationInTransaction; reading it once in applyMigrationList and passing it down would match the pattern in internal/migrations/runner.go.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
.github/workflows/deploy-develop.yml (1)
304-305: Avoid a second authoritative service list here.
SERVICE_DBSmatchesdefaultServiceDefsinservices/tenant/provisioner/provisioner.go:390-411today, but it now has to be maintained in two places. The next service addition can update the provisioner without updating this cleanup step, leaving staleorg_*schemas behind in the new database. A small shared script or repo check would make this much harder to drift.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/deploy-develop.yml around lines 304 - 305, SERVICE_DBS duplicates the canonical service list in defaultServiceDefs (provisioner.go); replace the hardcoded SERVICE_DBS with a single source of truth. Create a small script or command (e.g., scripts/list_services.sh or a Go helper that prints the space-separated service names) that is used by provisioner.go (or generated from defaultServiceDefs) and invoked by the deploy workflow; then update the workflow to call that script/command instead of defining SERVICE_DBS inline so the cleanup loop uses the same list as defaultServiceDefs and cannot drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/deploy-develop.yml:
- Around line 283-304: Ensure the deploy workflow stops the running application
before resetting tenant provisioning state and dropping DB schemas: add a step
to stop/scale down the meridian-develop service (or otherwise halt the app)
prior to the block that sets SERVICE_DBS and flips tenants to
provisioning_pending and clears tenant_provisioning.service_schemas; then only
after tenants are marked pending and schemas dropped, restart or redeploy
meridian-develop. Apply the same sequencing change to the corresponding block
referenced around lines 314-321 so the worker cannot race with the reset.
In `@services/tenant/provisioner/migration_runner.go`:
- Around line 300-313: The migration SQL is being fully adapted via
internalmigrations.AdaptCockroachDDLForPostgres before splitSQLStatements, which
breaks dollar-quoted DO blocks (splitSQLStatements doesn't handle dollar
quotes); move the adaptation to after splitting and apply
AdaptCockroachDDLForPostgres to each individual statement (or alternatively
improve splitSQLStatements to recognize dollar-quoted bodies), and ensure the
adaptation decision in PostgresProvisioner is based on the provisioner's actual
target driver (not internalmigrations.DriverFromEnv() which defaults to
CockroachDB) so the adaptation always runs when provisioning Postgres; update
the loop that iterates statements (the tx.Exec call) to exec the adapted
per-statement SQL and add a regression test that runs a public-schema CHECK
migration containing a DO $...$ block through this code path to verify no broken
fragments are executed.
---
Nitpick comments:
In @.github/workflows/deploy-develop.yml:
- Around line 304-305: SERVICE_DBS duplicates the canonical service list in
defaultServiceDefs (provisioner.go); replace the hardcoded SERVICE_DBS with a
single source of truth. Create a small script or command (e.g.,
scripts/list_services.sh or a Go helper that prints the space-separated service
names) that is used by provisioner.go (or generated from defaultServiceDefs) and
invoked by the deploy workflow; then update the workflow to call that
script/command instead of defining SERVICE_DBS inline so the cleanup loop uses
the same list as defaultServiceDefs and cannot drift.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9f0d523e-4835-416b-afb9-9ccc39457b0b
📒 Files selected for processing (4)
.github/workflows/deploy-develop.ymlinternal/migrations/adapt_test.gointernal/migrations/runner.goservices/tenant/provisioner/migration_runner.go
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…y sequencing Two critical issues flagged on the review: 1. Critical: AdaptCockroachDDLForPostgres was called on the full processed SQL before splitSQLStatements. The adapter wraps public-schema CHECK constraints in DO $compat$ BEGIN ...; EXCEPTION WHEN duplicate_object THEN NULL; END $compat$; blocks whose internal semicolons are then split into three invalid fragments by splitSQLStatements (which does not understand dollar-quoted bodies). Fix: split first, then apply the adapter to each statement individually, re-adding the trailing `;` the splitter stripped so the adapter's `;`-anchored regex still matches. Add TestPostgresAdapter_PerStatementAdaptation_PreservesDOBlocks which pins this ordering - if the pre-split flow is ever restored, the test proves the DO block survives splitting and fails the build. 2. Major: the deploy workflow reset tenant.status while meridian- develop was still running. The provisioning worker polls that column every 10s, so it would race the script: start provisioning against half-cleaned databases and get interrupted by the subsequent restart. Fix: stop meridian-develop before resetting state and dropping schemas, then start it back up afterwards. "restart" is replaced with "stop + work + start" so the worker observes the fully-reset state on boot.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
services/tenant/provisioner/migration_runner_test.go (1)
681-690: Tighten assertion to enforce exactly one DO-wrapped statement.The test comment says one adapted statement should be wrapped, but current checks only require at least one. Counting wrappers will make this guard stricter.
♻️ Suggested assertion hardening
// Find the DO-wrapped statement. var doStmt string + wrappedCount := 0 for _, a := range adapted { if strings.Contains(a, "DO $compat$") { + wrappedCount++ doStmt = a - break } } + require.Equal(t, 1, wrappedCount, "expected exactly one DO-wrapped statement") require.NotEmpty(t, doStmt, "expected one statement to be wrapped in DO $compat$ block")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/tenant/provisioner/migration_runner_test.go` around lines 681 - 690, The test currently only asserts that at least one adapted statement contains "DO $compat$"; tighten it to assert there is exactly one such wrapper by iterating over adapted (same slice variable) to count occurrences of "DO $compat$" (or collect matches) and replace require.NotEmpty(t, doStmt, ...) with require.Equal(t, 1, count, "expected exactly one statement to be wrapped in DO $compat$ block"); update/remove the doStmt variable accordingly so the test fails if zero or multiple DO-wrapped statements are present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@services/tenant/provisioner/migration_runner_test.go`:
- Around line 681-690: The test currently only asserts that at least one adapted
statement contains "DO $compat$"; tighten it to assert there is exactly one such
wrapper by iterating over adapted (same slice variable) to count occurrences of
"DO $compat$" (or collect matches) and replace require.NotEmpty(t, doStmt, ...)
with require.Equal(t, 1, count, "expected exactly one statement to be wrapped in
DO $compat$ block"); update/remove the doStmt variable accordingly so the test
fails if zero or multiple DO-wrapped statements are present.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 55d42d11-c21e-423b-bf7b-55ec54a27ca7
📒 Files selected for processing (3)
.github/workflows/deploy-develop.ymlservices/tenant/provisioner/migration_runner.goservices/tenant/provisioner/migration_runner_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/deploy-develop.yml
- services/tenant/provisioner/migration_runner.go
Addressed in follow-up commit. CodeRabbit re-reviewed and approved at 11:02.
The demo deploy had the same broken reset as develop had before PRs #2131 and #2133: - Only reset tenant_provisioning.state, not tenant.status (worker polls tenant.status via ListByStatus(StatusProvisioningPending)) - Didn't clear service_schemas JSONB (provisioner short-circuits on "service already provisioned, skipping") - Didn't drop ghost schemas from previous broken runs - Didn't stop the app before resetting (race between worker and schema drops) Port the same fix from deploy-develop.yml: - Stop meridian before resetting state - DROP SCHEMA CASCADE for org_* schemas across all service databases - Reset both tenant.status and tenant_provisioning.service_schemas - Start meridian after reset
) The demo deploy had the same broken reset as develop had before PRs #2131 and #2133: - Only reset tenant_provisioning.state, not tenant.status (worker polls tenant.status via ListByStatus(StatusProvisioningPending)) - Didn't clear service_schemas JSONB (provisioner short-circuits on "service already provisioned, skipping") - Didn't drop ghost schemas from previous broken runs - Didn't stop the app before resetting (race between worker and schema drops) Port the same fix from deploy-develop.yml: - Stop meridian before resetting state - DROP SCHEMA CASCADE for org_* schemas across all service databases - Reset both tenant.status and tenant_provisioning.service_schemas - Start meridian after reset Co-authored-by: Ben Coombs <bjcoombs@users.noreply.github.com>
Summary
AdaptCockroachDDLForPostgresfrominternal/migrationsand call it from the tenant provisioner's migration runner whenDB_DRIVER=postgres. The adapter was previously only applied by the CLI--migratepath that runs against public schemas; tenant schema provisioning bypassed it and hit "cannot drop index because constraint requires it" on Postgres.deploy-develop.ymltoDROP SCHEMA IF EXISTS org_* CASCADEacross all service databases before forcing re-provisioning. The previous reset cleared metadata but left ghost tables from broken runs, and migrations that rename tables (e.g. internal-account/20260225) could not reconcile the resulting mixed state.Why two fixes in one PR
Both problems surfaced on the same deploy cycle after PR #2131 merged. With only the DDL adapter fix, the CRDB-specific DROP INDEX CASCADE statements succeed, but tenant schemas still contain ghost state. With only the DROP SCHEMA fix, schemas are clean but CRDB syntax still fails on Postgres. Both are required for the deploy to reach a green state.
Evidence
Post-#2131 merge deploy on c49d82b failed with:
```
meridian_master: control-plane migrations failed: execute migration 20260401000002_migrate_version_to_varchar.sql:
ERROR: cannot drop index uq_manifest_version_version because constraint uq_manifest_version_version on table manifest_version requires it (SQLSTATE 2BP01)
volterra_energy: internal-account migrations failed: execute migration 20260116000001_add_clearing_purpose_column.sql:
ERROR: relation "internal_bank_account" does not exist (SQLSTATE 42P01)
```
The first is the CRDB->Postgres adapter gap. The second is ghost state from a previous broken run that left the schema with the renamed `internal_account` table but no `internal_bank_account`, breaking later re-runs that expect to ADD a column to it.
Changes
internal/migrations/runner.go + adapt_test.go
services/tenant/provisioner/migration_runner.go
.github/workflows/deploy-develop.yml
Test plan