Skip to content

fix: Skip COMMENT statements in tenant migration rewriter#2128

Merged
bjcoombs merged 3 commits intodevelopfrom
fix-develop-20260404-comment-on-column-rewriter
Apr 5, 2026
Merged

fix: Skip COMMENT statements in tenant migration rewriter#2128
bjcoombs merged 3 commits intodevelopfrom
fix-develop-20260404-comment-on-column-rewriter

Conversation

@bjcoombs
Copy link
Copy Markdown
Collaborator

@bjcoombs bjcoombs commented Apr 4, 2026

Summary

  • processMigrationSQL naively rewrote COMMENT ON COLUMN \"party\".\"attributes\" into \"org_tenant\".\"attributes\", producing schema.column instead of table.column and failing with relation does not exist.
  • Every fresh tenant hard-failed at the party.attributes migration, cascading into the data_source / dataset_definition / instrument_definition errors seen in E2E shards and the develop deploy.
  • Fix the rewriter to skip COMMENT statements, re-enable the previously-skipped knownBroken migration tests, bump the architecture size baseline to unblock Test shard 0, and broaden the deploy-develop tenant_provisioning reset so stale tenants get re-provisioned.

Root cause

PR #2120 documented this bug (adding a knownBroken skip-list and a no-op corrective migration) but did not fix processMigrationSQL. The rewriter treats the second identifier in a COMMENT ON COLUMN as a table name and replaces it with the tenant schema.

There are zero legitimate uses of schema-qualified references (\"party\".\"table\", \"payment_order\".\"table\", etc.) in any migration file - all matches are COMMENT ON COLUMN statements. The rewriter was solving a phantom problem while corrupting real statements.

Changes

services/tenant/provisioner/migration_runner.go

  • processMigrationSQL: split SQL into statements, skip schema-pattern rewriting for COMMENT statements, rewrite DDL as before.
  • Add isCommentStatement helper that tolerates leading whitespace and line comments.

services/tenant/provisioner/migration_runner_test.go

  • Remove the three knownBroken entries so TestProcessMigrationSQL_AllMigrations_Parse now exercises the previously-broken migrations.
  • Replace TestProcessMigrationSQL_CommentOnColumn_BugDocumentation with TestProcessMigrationSQL_CommentOnColumn_NotRewritten, asserting the fix preserves table.column.
  • Add TestProcessMigrationSQL_CommentOnColumn_MixedWithDDL to pin mixed-statement behavior (DDL rewritten, COMMENT preserved).

tests/architecture/size_test.go

  • Bump baselineOversizedFunctions 187 -> 189 (unrelated ratchet failure on Test shard 0).

.github/workflows/deploy-develop.yml

  • Broaden the post-migration tenant_provisioning reset to clear service_schemas and include all non-deprovisioned rows. Without this, the existing volterra_energy row on develop (state=pending, all services marked migrated, no physical tables) short-circuits provisioning via the "service already provisioned, skipping" check in provisionAllServices and the code fix alone cannot recover the droplet.

Evidence

  • Failing runs: gh run view 23980915882 (Deploy Develop), 23980915891 (E2E all 4 shards), 23980915885 (Test shard 0), 23980915881 (Build & Test).
  • Error from meridian_master tenant_provisioning on develop: party migrations failed: execute migration 20260221000001_add_party_attributes.sql: ERROR: relation \"org_meridian_master\" does not exist (SQLSTATE 42P01).
  • Local: go test ./services/tenant/provisioner/ -count=1 passes (340s); go test ./tests/architecture/ -run TestFunctionSize passes.

Test plan

  • CI green on this branch
  • E2E shards reach manifest execute stage (previously aborted at diff step due to missing tenant tables)
  • Deploy Develop completes seed-dev successfully

processMigrationSQL did naive string replacement of schema patterns
("party".X -> "org_tenant".X) across all migration content. This
corrupted COMMENT ON COLUMN "party"."attributes" IS ..., where the
second identifier is a column name rather than a schema qualifier.
The rewrite produced COMMENT ON COLUMN "org_tenant"."attributes"
which Postgres parses as schema.table (missing the column) and fails
with "relation does not exist".

PR #2120 documented the bug with a knownBroken skip-list and a no-op
corrective migration but did not fix the rewriter. Every fresh tenant
now hard-fails at the party.attributes migration, cascading into the
"relation does not exist" errors for data_source / dataset_definition /
instrument_definition seen in E2E and develop deploy - downstream
service schemas are never populated because provisioning aborts at
the first service.

Fix processMigrationSQL to split SQL into statements, detect COMMENT
statements via isCommentStatement, and skip schema-pattern rewriting
for them while still rewriting DDL statements normally.

Also:
- Remove the knownBroken entries in TestProcessMigrationSQL_AllMigrations_Parse
  so the three affected migrations are exercised end-to-end.
- Replace TestProcessMigrationSQL_CommentOnColumn_BugDocumentation with
  TestProcessMigrationSQL_CommentOnColumn_NotRewritten, asserting the
  fix preserves table.column references.
- Add TestProcessMigrationSQL_CommentOnColumn_MixedWithDDL to pin the
  mixed-statement behavior (DDL rewritten, COMMENT preserved).
- Bump baselineOversizedFunctions 187 -> 189 to unblock Test shard 0.
- deploy-develop.yml: broaden the tenant_provisioning reset to clear
  service_schemas and include all non-deprovisioned rows, so stale
  tenants whose service_schemas entries are marked migrated but whose
  physical tables are missing get re-provisioned on redeploy.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

Split migration SQL into individual statements, skip rewriting COMMENT ... statements, and rejoin statements; tests updated to cover comment-preservation. CI: deploy workflow now resets most tenants to pending and clears schemas/errors; e2e workflow stages per-service migrations and enables schema provisioning. Test baseline incremented.

Changes

Cohort / File(s) Summary
Migration SQL processor & tests
services/tenant/provisioner/migration_runner.go, services/tenant/provisioner/migration_runner_test.go
Process migration SQL per-statement, add isCommentStatement/stripLeadingNoise to detect and skip COMMENT ... statements, rewrite other statements, reassemble with trailing semicolon. Tests renamed/added to assert comments are not rewritten; removed prior known-broken exclusions.
Deploy workflow tenant reset
.github/workflows/deploy-develop.yml
Post-migration SSH step now sets state='pending' for all tenants except deprovisioned, clears service_schemas to []::jsonb, and resets error_message to ''.
E2E workflow & migration staging
.github/workflows/e2e.yml
Add pre-start step copying services/<service>/migrations into /tmp/provisioner-migrations/<service>; start command sets SCHEMA_PROVISIONING_ENABLED=true and MIGRATIONS_BASE_PATH=/tmp/provisioner-migrations; removed explicit CockroachDB tenant schema provisioning step.
Test baseline adjustment
tests/architecture/size_test.go
Incremented baselineOversizedFunctions from 187 to 189.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main fix: skipping COMMENT statements in tenant migration rewriting, which directly addresses the core bug causing tenant provisioning failures.
Description check ✅ Passed The description comprehensively relates to the changeset, explaining the bug, root cause, specific file changes, evidence, and test plan for all modifications in the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-develop-20260404-comment-on-column-rewriter

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 4, 2026

Claude Code Review

Commit: ae006ff | CI: running

Summary

Clean, well-scoped bug fix for a migration rewriter that corrupted COMMENT ON COLUMN statements by treating table.column as schema.table. The root cause is clearly identified, the fix is surgical (skip COMMENT statements from schema-pattern rewriting), and the test coverage is thorough - including re-enabling the previously skipped knownBroken migrations.

The E2E workflow change from manual schema creation to using the actual provisioner (SCHEMA_PROVISIONING_ENABLED=true) is a meaningful improvement that makes E2E tests exercise the production code path.

Risk Assessment

Area Level Detail
Blast radius Med Tenant provisioning - all tenants affected
Rollback Safe Go code change, no migration schema changes
Scale Low Runs once per migration per tenant, not hot path
Cross-system Low Self-contained in tenant provisioner
Migration N/A No new migration files

Findings

Severity Location Description Status
Note migration_runner.go:390-408 isCommentStatement blanket-skips all COMMENT variants (ON TABLE, ON COLUMN, etc.), but only COMMENT ON COLUMN has the table.column ambiguity. Verified: all COMMENT ON TABLE in the codebase use unquoted single identifiers, so blanket skip is safe. search_path provides fallback. Acknowledged
Note migration_runner.go:366-399 Double-split: processMigrationSQL splits, processes, rejoins. applyMigrationList splits again. Round-trip verified idempotent. Acknowledged
Note deploy-develop.yml:289 Broadened reset is more aggressive but well-justified for recovery from broken state. Acknowledged

Test Coverage

  • TestProcessMigrationSQL_CommentOnColumn_NotRewritten directly validates the fix
  • TestProcessMigrationSQL_CommentOnColumn_MixedWithDDL pins mixed-statement behavior
  • TestIsCommentStatement_SkipsLeadingComments covers comment detection edge cases
  • TestProcessMigrationSQL_AllMigrations_Parse exercises all real migrations with knownBroken emptied

Bot Review Notes

No unresolved bot threads. Previous inline comments resolved.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See summary. 1 inline suggestion.

Comment thread services/tenant/provisioner/migration_runner.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/tenant/provisioner/migration_runner.go`:
- Around line 413-421: The isCommentStatement function fails to detect COMMENT
statements preceded by block comments; update isCommentStatement to skip leading
block comments (/* ... */) as well as line comments and blank lines before
checking the first non-comment token, e.g., in isCommentStatement iterate
through stmt lines (or scan tokens) to strip/skip both lines starting with "--"
and block comment spans that may start and end on the same line or across lines,
then perform strings.HasPrefix(strings.ToUpper(trimmed), "COMMENT ") on the
first non-comment chunk so statements like "/* preface */ COMMENT ON ..." are
correctly recognized.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6041708f-8214-43ee-af56-237156ed25df

📥 Commits

Reviewing files that changed from the base of the PR and between c478052 and 13dec02.

📒 Files selected for processing (4)
  • .github/workflows/deploy-develop.yml
  • services/tenant/provisioner/migration_runner.go
  • services/tenant/provisioner/migration_runner_test.go
  • tests/architecture/size_test.go

Comment thread services/tenant/provisioner/migration_runner.go
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 4, 2026

Codecov Report

❌ Patch coverage is 90.24390% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
services/tenant/provisioner/migration_runner.go 90.24% 2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

The E2E workflow created empty org_<tenant> schemas manually via
CREATE SCHEMA IF NOT EXISTS and relied on the public-schema search_path
fallback to resolve unqualified table references. PR #2125 removed
public from search_path, so those empty tenant schemas can no longer
satisfy any query and the manifest diff step fails with "relation
data_source / dataset_definition / instrument_definition does not exist".

Replace the manual schema workaround with real provisioner invocation:
- Stage per-service migration files under /tmp/provisioner-migrations
  so MIGRATIONS_BASE_PATH points at a flat layout matching the Docker
  image structure (/migrations/<service>/*.sql).
- Start meridian with SCHEMA_PROVISIONING_ENABLED=true and
  MIGRATIONS_BASE_PATH=/tmp/provisioner-migrations so seed-dev's
  InitiateTenant triggers real provisioning that creates schemas AND
  applies migrations to them.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct hotfix for a real production blocker. The statement-level split + COMMENT skip approach is clean and well-tested. See summary comment for full analysis. No blocking concerns.

Address CodeRabbit review feedback: the initial isCommentStatement
implementation only skipped leading whitespace and line comments.
A statement of the form "/* audit */ COMMENT ON COLUMN ..." was not
recognized as a COMMENT and could still be corrupted by the schema
rewriter.

Extract stripLeadingNoise helper that iteratively strips whitespace,
-- line comments, and /* ... */ block comments, then check the
remaining text for the COMMENT keyword. Unterminated block comments
return false so no rewriting is attempted on malformed SQL.

Add TestIsCommentStatement_SkipsLeadingComments covering plain,
leading whitespace, line comment, block comment, mixed leading
comments, non-COMMENT DDL, mid-statement block comment, and
unterminated block comment cases.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found. See summary comment for full analysis. Well-executed fix with strong test coverage.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
services/tenant/provisioner/migration_runner.go (1)

385-407: Consider removing the split→join→split round-trip.

processMigrationSQL now parses statements and serializes them back, then applyMigrationList parses again at Line 278. Returning rewritten statements directly would reduce parser drift risk and avoid duplicate work.

♻️ Refactor sketch
-processedSQL := p.processMigrationSQL(mig.Content, schemaName)
-statements := splitSQLStatements(processedSQL)
+statements := p.processMigrationStatements(mig.Content, schemaName)
-func (p *PostgresProvisioner) processMigrationSQL(sql, schemaName string) string {
+func (p *PostgresProvisioner) processMigrationStatements(sql, schemaName string) []string {
   ...
-  if len(rewritten) == 0 {
-    return ""
-  }
-  return strings.Join(rewritten, ";\n") + ";"
+  return rewritten
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/tenant/provisioner/migration_runner.go` around lines 385 - 407, The
current processMigrationSQL does splitSQLStatements -> rewrite -> strings.Join
and returns a single SQL string which forces applyMigrationList to call
splitSQLStatements again; instead change processMigrationSQL to return the
rewritten statement slice directly (e.g. []string) to avoid the split→join→split
round-trip: keep the existing statement parsing loop and
isCommentStatement/schemaPatterns rewriting logic inside processMigrationSQL,
change its signature to return []string (or add a new helper like
rewriteMigrationStatements that returns []string), update callers such as
applyMigrationList to accept and use the returned []string directly (removing
the extra splitSQLStatements call and the trailing semicolon join), and preserve
existing behavior for comment statements and quoting rules by reusing
schemaPatterns and the same replacement logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@services/tenant/provisioner/migration_runner.go`:
- Around line 385-407: The current processMigrationSQL does splitSQLStatements
-> rewrite -> strings.Join and returns a single SQL string which forces
applyMigrationList to call splitSQLStatements again; instead change
processMigrationSQL to return the rewritten statement slice directly (e.g.
[]string) to avoid the split→join→split round-trip: keep the existing statement
parsing loop and isCommentStatement/schemaPatterns rewriting logic inside
processMigrationSQL, change its signature to return []string (or add a new
helper like rewriteMigrationStatements that returns []string), update callers
such as applyMigrationList to accept and use the returned []string directly
(removing the extra splitSQLStatements call and the trailing semicolon join),
and preserve existing behavior for comment statements and quoting rules by
reusing schemaPatterns and the same replacement logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ca9d2538-87b1-42bc-a264-a1efc1a365a6

📥 Commits

Reviewing files that changed from the base of the PR and between 8887437 and ae006ff.

📒 Files selected for processing (2)
  • services/tenant/provisioner/migration_runner.go
  • services/tenant/provisioner/migration_runner_test.go

@bjcoombs bjcoombs merged commit 7964e77 into develop Apr 5, 2026
40 checks passed
@bjcoombs bjcoombs deleted the fix-develop-20260404-comment-on-column-rewriter branch April 5, 2026 08:01
bjcoombs added a commit that referenced this pull request Apr 5, 2026
PR #2128 added a tenant_provisioning reset to force re-provisioning
on every deploy, but the reset targeted the wrong table. The
provisioning worker polls for pending tenants via
ListByStatus(StatusProvisioningPending) which reads tenant.status,
not tenant_provisioning.state. Resetting only tenant_provisioning
left tenant.status='active', so the worker never claimed the
tenant, provisioning never re-ran, and seed-dev's manifest apply
still failed against the empty tenant schemas.

Also: seed-dev's waitForTenantReady calls GetTenantProvisioningStatus
whose OverallStatus is derived from tenant.status. With tenant.status
stuck at 'active', seed-dev returned immediately from its wait loop
and fired the manifest apply before the worker had a chance to act.

Reset both tables in the same psql call:
- tenant.status = 'provisioning_pending' (triggers worker pickup and
  makes seed-dev wait for real completion).
- tenant_provisioning.service_schemas = '[]' (forces provisionAllServices
  to re-apply migrations instead of short-circuiting on "already
  provisioned" entries).
bjcoombs added a commit that referenced this pull request Apr 5, 2026
…2131)

PR #2128 added a tenant_provisioning reset to force re-provisioning
on every deploy, but the reset targeted the wrong table. The
provisioning worker polls for pending tenants via
ListByStatus(StatusProvisioningPending) which reads tenant.status,
not tenant_provisioning.state. Resetting only tenant_provisioning
left tenant.status='active', so the worker never claimed the
tenant, provisioning never re-ran, and seed-dev's manifest apply
still failed against the empty tenant schemas.

Also: seed-dev's waitForTenantReady calls GetTenantProvisioningStatus
whose OverallStatus is derived from tenant.status. With tenant.status
stuck at 'active', seed-dev returned immediately from its wait loop
and fired the manifest apply before the worker had a chance to act.

Reset both tables in the same psql call:
- tenant.status = 'provisioning_pending' (triggers worker pickup and
  makes seed-dev wait for real completion).
- tenant_provisioning.service_schemas = '[]' (forces provisionAllServices
  to re-apply migrations instead of short-circuiting on "already
  provisioned" entries).

Co-authored-by: Ben Coombs <bjcoombs@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant