Skip to content

fix: Apply CockroachDB->Postgres DDL adapter in tenant provisioner#2133

Merged
bjcoombs merged 2 commits intodevelopfrom
fix-migration-drop-constraint-postgres
Apr 5, 2026
Merged

fix: Apply CockroachDB->Postgres DDL adapter in tenant provisioner#2133
bjcoombs merged 2 commits intodevelopfrom
fix-migration-drop-constraint-postgres

Conversation

@bjcoombs
Copy link
Copy Markdown
Collaborator

@bjcoombs bjcoombs commented Apr 5, 2026

Summary

  • Export AdaptCockroachDDLForPostgres from internal/migrations and call it from the tenant provisioner's migration runner when DB_DRIVER=postgres. The adapter was previously only applied by the CLI --migrate path that runs against public schemas; tenant schema provisioning bypassed it and hit "cannot drop index because constraint requires it" on Postgres.
  • Update deploy-develop.yml to DROP SCHEMA IF EXISTS org_* CASCADE across all service databases before forcing re-provisioning. The previous reset cleared metadata but left ghost tables from broken runs, and migrations that rename tables (e.g. internal-account/20260225) could not reconcile the resulting mixed state.

Why two fixes in one PR

Both problems surfaced on the same deploy cycle after PR #2131 merged. With only the DDL adapter fix, the CRDB-specific DROP INDEX CASCADE statements succeed, but tenant schemas still contain ghost state. With only the DROP SCHEMA fix, schemas are clean but CRDB syntax still fails on Postgres. Both are required for the deploy to reach a green state.

Evidence

Post-#2131 merge deploy on c49d82b failed with:
```
meridian_master: control-plane migrations failed: execute migration 20260401000002_migrate_version_to_varchar.sql:
ERROR: cannot drop index uq_manifest_version_version because constraint uq_manifest_version_version on table manifest_version requires it (SQLSTATE 2BP01)

volterra_energy: internal-account migrations failed: execute migration 20260116000001_add_clearing_purpose_column.sql:
ERROR: relation "internal_bank_account" does not exist (SQLSTATE 42P01)
```
The first is the CRDB->Postgres adapter gap. The second is ghost state from a previous broken run that left the schema with the renamed `internal_account` table but no `internal_bank_account`, breaking later re-runs that expect to ADD a column to it.

Changes

internal/migrations/runner.go + adapt_test.go

  • Rename `adaptCockroachDDLForPostgres` to `AdaptCockroachDDLForPostgres` (exported).
  • Update internal callsite and test file references.

services/tenant/provisioner/migration_runner.go

  • Import `internal/migrations` (aliased as `internalmigrations` because the local `migration` type shadows the package name).
  • Call `AdaptCockroachDDLForPostgres` in `applyMigrationInTransaction` when `DriverFromEnv() == DriverPostgres`.
  • Extract `applyMigrationInTransaction` helper to keep `applyMigrationList` under the gocognit complexity limit.

.github/workflows/deploy-develop.yml

  • Before the tenant_provisioning reset, iterate each service DB and `DROP SCHEMA IF EXISTS org_* CASCADE`. Uses `pg_namespace` with an ESCAPE clause to match `org_%` literally.

Test plan

  • `go build ./...` clean
  • `go test ./internal/migrations/... ./services/tenant/provisioner/...` passes locally (335s)
  • pre-commit (gitleaks, gofumpt, golangci-lint, gocognit) all green
  • CI green on this branch
  • After merge: develop deploy reaches "Tenant provisioned and active" and seed-dev manifest apply succeeds
  • Verify on droplet: `tenant.status` reaches `active` for both `volterra_energy` and `meridian_master`, and tenant schemas contain the full table set

Two related fixes so the develop deploy can fully recover after a
broken provisioning run:

1. Export internal/migrations.AdaptCockroachDDLForPostgres and call
   it from services/tenant/provisioner/migration_runner.go when
   DB_DRIVER=postgres. Previously the adapter was only applied by
   the CLI --migrate path that runs against public schemas; the
   tenant provisioner applied CockroachDB-native DDL directly to
   tenant schemas and hit "cannot drop index because constraint
   requires it" on Postgres for migrations like control-plane/
   20260401000002_migrate_version_to_varchar.sql and reference-data/
   20260127000001_fix_platform_saga_unique_constraint.sql. Tenant
   provisioning now gets the same Postgres compatibility treatment.

2. Update .github/workflows/deploy-develop.yml to DROP SCHEMA IF
   EXISTS ... CASCADE for every org_* schema across all service
   databases before forcing re-provisioning. The previous reset
   cleared tenant.status and service_schemas metadata but left
   physical tables from the broken run in place. Migrations that
   rename tables (e.g. internal-account/20260225 rename_to_internal_
   account) leave the tenant schema in a state that later reruns
   cannot reconcile because CREATE TABLE succeeds creating a fresh
   empty table next to the existing renamed one. Dropping and
   recreating from scratch matches the E2E workflow path which is
   known to pass.

Internal aliasing: the provisioner package already has a local
type named "migration", so the internal/migrations import is
aliased to internalmigrations to avoid shadowing.

Refactor: extracted applyMigrationInTransaction helper out of
applyMigrationList to keep cognitive complexity under the
architecture baseline (gocognit limit).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 5, 2026

📝 Walkthrough

Walkthrough

Deployment workflow now stops the app container, drops physical org_<tenant> schemas across service databases, clears provisioning state rows, then restarts the container. A CockroachDB→Postgres DDL adapter was exported and tenant provisioner refactored to run per-migration transactions with optional per-statement DDL adaptation.

Changes

Cohort / File(s) Summary
Deployment Workflow
.github/workflows/deploy-develop.yml
Stops the meridian-develop container, loops over SERVICE_DBS to drop org_<tenant> schemas (DROP SCHEMA IF EXISTS ... CASCADE) via postgres-develop, clears service_schemas and resets provisioning rows, then starts the container so the provisioning worker observes the reset state.
DDL Adapter Export & Tests
internal/migrations/runner.go, internal/migrations/adapt_test.go
Renamed and exported adaptCockroachDDLForPostgresAdaptCockroachDDLForPostgres; updated call sites and tests to use the exported function.
Tenant Provisioner Refactor & Tests
services/tenant/provisioner/migration_runner.go, services/tenant/provisioner/migration_runner_test.go
Introduced applyMigrationInTransaction to execute each migration in its own transaction, added per-statement Cockroach→Postgres adaptation when running under Postgres driver, and added tests verifying that per-statement adaptation preserves DO-block wrapping semantics.

Sequence Diagram(s)

sequenceDiagram
    participant GH as GitHub Actions
    participant Host as Docker Host
    participant Container as meridian-develop (app)
    participant Postgres as postgres-develop
    participant DBs as Service DBs
    participant Provisioner as Provisioning Worker

    GH->>Host: stop Container
    Host->>Container: docker compose stop
    GH->>Postgres: for each SERVICE_DB -> query pg_namespace for org_<tenant>
    Postgres->>DBs: execute DROP SCHEMA IF EXISTS org_<tenant> CASCADE
    Postgres->>Postgres: UPDATE tenant.status / tenant_provisioning (clear service_schemas, set pending)
    Host->>Container: docker compose start
    Container->>Provisioner: boot / worker starts
    Provisioner->>Postgres: read tenant.status and tenant_provisioning
    Provisioner->>Provisioner: enqueue/process provisioning based on reset state
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main change: exporting and applying the CockroachDB-to-Postgres DDL adapter in the tenant provisioner, which is the core fix for the Postgres migration errors.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the rationale, evidence of the problem, specific changes made across multiple files, and test results.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-migration-drop-constraint-postgres

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 5, 2026

Claude Code Review

Commit: 08dd70d | CI: running (most checks still pending)

Solid bug fix that closes two related gaps in the develop deploy pipeline. The first commit correctly exports AdaptCockroachDDLForPostgres and wires it into the tenant provisioner. The second commit (responding to CodeRabbit) fixes a real correctness issue: applying the adapter per-statement after splitting rather than on the full SQL blob, and eliminates a race condition in the deploy workflow by stopping the app before resetting state.

The code is well-structured, the test pinning the split-then-adapt ordering is valuable, and the deploy workflow comments clearly explain the sequencing rationale.

Risk Assessment: All Low. Affects develop deploy only (production uses CockroachDB natively). Safe to revert. No schema migrations. Not hot path. Self-contained to provisioner and deploy workflow.

Findings: Note - migration_runner.go:310 - DriverFromEnv() gate relies on DB_DRIVER env var; if unset, adapter silently skips. Not a real risk since the entire system depends on this same env var. Consistent with CLI path in runner.go:405. Informational only.

Bot Review Notes:

CodeRabbit thread on migration_runner.go:320 (unresolved): Split ordering concern (Critical) - Fully addressed in commit 08dd70d. Code now splits first via splitSQLStatements, then adapts each statement individually. New test TestPostgresAdapter_PerStatementAdaptation_PreservesDOBlocks pins this ordering. DriverFromEnv() defaulting concern - Still present but disagree this is a bug. PostgresProvisioner only runs in environments where DB_DRIVER=postgres is configured. If env var is missing, the entire system would be wrong, not just this adapter. Pattern is identical to CLI path in runner.go:405.

CodeRabbit thread on deploy-develop.yml:314 (resolved by CodeRabbit itself): Stop-before-reset concern addressed in commit 08dd70d. Workflow now does stop, reset state + drop schemas, then start.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean fix. Well-scoped, correctly coupled, minimal blast radius. See summary comment for details. One minor suggestion on migration_runner.go:299 — DriverFromEnv() is called per-migration inside applyMigrationInTransaction; reading it once in applyMigrationList and passing it down would match the pattern in internal/migrations/runner.go.

coderabbitai[bot]
coderabbitai Bot previously requested changes Apr 5, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
.github/workflows/deploy-develop.yml (1)

304-305: Avoid a second authoritative service list here.

SERVICE_DBS matches defaultServiceDefs in services/tenant/provisioner/provisioner.go:390-411 today, but it now has to be maintained in two places. The next service addition can update the provisioner without updating this cleanup step, leaving stale org_* schemas behind in the new database. A small shared script or repo check would make this much harder to drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy-develop.yml around lines 304 - 305, SERVICE_DBS
duplicates the canonical service list in defaultServiceDefs (provisioner.go);
replace the hardcoded SERVICE_DBS with a single source of truth. Create a small
script or command (e.g., scripts/list_services.sh or a Go helper that prints the
space-separated service names) that is used by provisioner.go (or generated from
defaultServiceDefs) and invoked by the deploy workflow; then update the workflow
to call that script/command instead of defining SERVICE_DBS inline so the
cleanup loop uses the same list as defaultServiceDefs and cannot drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/deploy-develop.yml:
- Around line 283-304: Ensure the deploy workflow stops the running application
before resetting tenant provisioning state and dropping DB schemas: add a step
to stop/scale down the meridian-develop service (or otherwise halt the app)
prior to the block that sets SERVICE_DBS and flips tenants to
provisioning_pending and clears tenant_provisioning.service_schemas; then only
after tenants are marked pending and schemas dropped, restart or redeploy
meridian-develop. Apply the same sequencing change to the corresponding block
referenced around lines 314-321 so the worker cannot race with the reset.

In `@services/tenant/provisioner/migration_runner.go`:
- Around line 300-313: The migration SQL is being fully adapted via
internalmigrations.AdaptCockroachDDLForPostgres before splitSQLStatements, which
breaks dollar-quoted DO blocks (splitSQLStatements doesn't handle dollar
quotes); move the adaptation to after splitting and apply
AdaptCockroachDDLForPostgres to each individual statement (or alternatively
improve splitSQLStatements to recognize dollar-quoted bodies), and ensure the
adaptation decision in PostgresProvisioner is based on the provisioner's actual
target driver (not internalmigrations.DriverFromEnv() which defaults to
CockroachDB) so the adaptation always runs when provisioning Postgres; update
the loop that iterates statements (the tx.Exec call) to exec the adapted
per-statement SQL and add a regression test that runs a public-schema CHECK
migration containing a DO $...$ block through this code path to verify no broken
fragments are executed.

---

Nitpick comments:
In @.github/workflows/deploy-develop.yml:
- Around line 304-305: SERVICE_DBS duplicates the canonical service list in
defaultServiceDefs (provisioner.go); replace the hardcoded SERVICE_DBS with a
single source of truth. Create a small script or command (e.g.,
scripts/list_services.sh or a Go helper that prints the space-separated service
names) that is used by provisioner.go (or generated from defaultServiceDefs) and
invoked by the deploy workflow; then update the workflow to call that
script/command instead of defining SERVICE_DBS inline so the cleanup loop uses
the same list as defaultServiceDefs and cannot drift.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9f0d523e-4835-416b-afb9-9ccc39457b0b

📥 Commits

Reviewing files that changed from the base of the PR and between 827b5c4 and 6da5cad.

📒 Files selected for processing (4)
  • .github/workflows/deploy-develop.yml
  • internal/migrations/adapt_test.go
  • internal/migrations/runner.go
  • services/tenant/provisioner/migration_runner.go

Comment thread .github/workflows/deploy-develop.yml
Comment thread services/tenant/provisioner/migration_runner.go
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 5, 2026

Codecov Report

❌ Patch coverage is 71.42857% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
services/tenant/provisioner/migration_runner.go 71.42% 2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

…y sequencing

Two critical issues flagged on the review:

1. Critical: AdaptCockroachDDLForPostgres was called on the full
   processed SQL before splitSQLStatements. The adapter wraps
   public-schema CHECK constraints in DO $compat$ BEGIN ...;
   EXCEPTION WHEN duplicate_object THEN NULL; END $compat$; blocks
   whose internal semicolons are then split into three invalid
   fragments by splitSQLStatements (which does not understand
   dollar-quoted bodies).

   Fix: split first, then apply the adapter to each statement
   individually, re-adding the trailing `;` the splitter stripped
   so the adapter's `;`-anchored regex still matches. Add
   TestPostgresAdapter_PerStatementAdaptation_PreservesDOBlocks
   which pins this ordering - if the pre-split flow is ever
   restored, the test proves the DO block survives splitting and
   fails the build.

2. Major: the deploy workflow reset tenant.status while meridian-
   develop was still running. The provisioning worker polls that
   column every 10s, so it would race the script: start
   provisioning against half-cleaned databases and get interrupted
   by the subsequent restart.

   Fix: stop meridian-develop before resetting state and dropping
   schemas, then start it back up afterwards. "restart" is
   replaced with "stop + work + start" so the worker observes the
   fully-reset state on boot.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
services/tenant/provisioner/migration_runner_test.go (1)

681-690: Tighten assertion to enforce exactly one DO-wrapped statement.

The test comment says one adapted statement should be wrapped, but current checks only require at least one. Counting wrappers will make this guard stricter.

♻️ Suggested assertion hardening
 	// Find the DO-wrapped statement.
 	var doStmt string
+	wrappedCount := 0
 	for _, a := range adapted {
 		if strings.Contains(a, "DO $compat$") {
+			wrappedCount++
 			doStmt = a
-			break
 		}
 	}
+	require.Equal(t, 1, wrappedCount, "expected exactly one DO-wrapped statement")
 	require.NotEmpty(t, doStmt, "expected one statement to be wrapped in DO $compat$ block")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/tenant/provisioner/migration_runner_test.go` around lines 681 - 690,
The test currently only asserts that at least one adapted statement contains "DO
$compat$"; tighten it to assert there is exactly one such wrapper by iterating
over adapted (same slice variable) to count occurrences of "DO $compat$" (or
collect matches) and replace require.NotEmpty(t, doStmt, ...) with
require.Equal(t, 1, count, "expected exactly one statement to be wrapped in DO
$compat$ block"); update/remove the doStmt variable accordingly so the test
fails if zero or multiple DO-wrapped statements are present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@services/tenant/provisioner/migration_runner_test.go`:
- Around line 681-690: The test currently only asserts that at least one adapted
statement contains "DO $compat$"; tighten it to assert there is exactly one such
wrapper by iterating over adapted (same slice variable) to count occurrences of
"DO $compat$" (or collect matches) and replace require.NotEmpty(t, doStmt, ...)
with require.Equal(t, 1, count, "expected exactly one statement to be wrapped in
DO $compat$ block"); update/remove the doStmt variable accordingly so the test
fails if zero or multiple DO-wrapped statements are present.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 55d42d11-c21e-423b-bf7b-55ec54a27ca7

📥 Commits

Reviewing files that changed from the base of the PR and between 6da5cad and 08dd70d.

📒 Files selected for processing (3)
  • .github/workflows/deploy-develop.yml
  • services/tenant/provisioner/migration_runner.go
  • services/tenant/provisioner/migration_runner_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/deploy-develop.yml
  • services/tenant/provisioner/migration_runner.go

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See summary comment for full review. Both CodeRabbit concerns addressed in the second commit. No blocking issues found. Clean, well-tested fix.

@bjcoombs bjcoombs dismissed coderabbitai[bot]’s stale review April 5, 2026 11:02

Addressed in follow-up commit. CodeRabbit re-reviewed and approved at 11:02.

@bjcoombs bjcoombs merged commit f818796 into develop Apr 5, 2026
40 checks passed
@bjcoombs bjcoombs deleted the fix-migration-drop-constraint-postgres branch April 5, 2026 12:56
bjcoombs added a commit that referenced this pull request Apr 6, 2026
The demo deploy had the same broken reset as develop had before PRs
#2131 and #2133:
- Only reset tenant_provisioning.state, not tenant.status (worker
  polls tenant.status via ListByStatus(StatusProvisioningPending))
- Didn't clear service_schemas JSONB (provisioner short-circuits on
  "service already provisioned, skipping")
- Didn't drop ghost schemas from previous broken runs
- Didn't stop the app before resetting (race between worker and
  schema drops)

Port the same fix from deploy-develop.yml:
- Stop meridian before resetting state
- DROP SCHEMA CASCADE for org_* schemas across all service databases
- Reset both tenant.status and tenant_provisioning.service_schemas
- Start meridian after reset
bjcoombs added a commit that referenced this pull request Apr 6, 2026
)

The demo deploy had the same broken reset as develop had before PRs
#2131 and #2133:
- Only reset tenant_provisioning.state, not tenant.status (worker
  polls tenant.status via ListByStatus(StatusProvisioningPending))
- Didn't clear service_schemas JSONB (provisioner short-circuits on
  "service already provisioned, skipping")
- Didn't drop ghost schemas from previous broken runs
- Didn't stop the app before resetting (race between worker and
  schema drops)

Port the same fix from deploy-develop.yml:
- Stop meridian before resetting state
- DROP SCHEMA CASCADE for org_* schemas across all service databases
- Reset both tenant.status and tenant_provisioning.service_schemas
- Start meridian after reset

Co-authored-by: Ben Coombs <bjcoombs@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant