Skip to content

fix: Reset tenant.status in deploy-develop so worker re-provisions#2131

Merged
bjcoombs merged 1 commit intodevelopfrom
fix-develop-deploy-tenant-status-reset
Apr 5, 2026
Merged

fix: Reset tenant.status in deploy-develop so worker re-provisions#2131
bjcoombs merged 1 commit intodevelopfrom
fix-develop-deploy-tenant-status-reset

Conversation

@bjcoombs
Copy link
Copy Markdown
Collaborator

@bjcoombs bjcoombs commented Apr 5, 2026

Summary

  • PR fix: Skip COMMENT statements in tenant migration rewriter #2128 reset the wrong table. The workflow updated tenant_provisioning but the provisioning worker polls tenant.status via ListByStatus(StatusProvisioningPending). Worker never picked up existing tenants, provisioning never re-ran, manifest apply still failed.
  • Also, seed-dev's waitForTenantReady calls GetTenantProvisioningStatus whose OverallStatus is derived from tenant.status. With tenant.status='active' stale, seed-dev returned immediately and fired the manifest apply before the worker could act.
  • Reset both tables in the same psql call.

Evidence

On develop right now (post-PR #2128 merge + deploy):

tenant_id        | state
volterra_energy  | pending      <- tenant_provisioning
meridian_master  | pending      <- tenant_provisioning

id               | status
volterra_energy  | active       <- tenant (still)
meridian_master  | active       <- tenant (still)

Deploy Develop CI log, 2026-04-05 08:06:

UPDATE 2                                                                  <- tenant_provisioning reset worked
Creating tenant 'volterra_energy' ...
Dev tenant already exists (idempotent).
Waiting for tenant provisioning ...
Tenant provisioned and active.                                            <- lie, tenant.status was just stale
Step [diff]: STEP_RESULT_STATUS_FAILED - relation "data_source" does not exist

Code references

  • services/tenant/worker/provisioning_worker.go:292 - w.repo.ListByStatus(ctx, domain.StatusProvisioningPending, ...) - the worker only picks up tenants whose tenant.status = 'provisioning_pending'.
  • services/tenant/service/grpc_provisioning_endpoints.go:151 - OverallStatus: s.toProtoStatus(tenant.Status) - seed-dev's wait loop reads this.
  • services/tenant/provisioner/postgres_provisioner.go:262 - the "service already provisioned, skipping" short-circuit still needs service_schemas = '[]' to bypass (kept from PR fix: Skip COMMENT statements in tenant migration rewriter #2128).

Test plan

  • Merge and trigger develop deploy
  • Watch droplet: tenant.status should flip to provisioning_pendingprovisioningactive before seed-dev runs
  • Deploy Develop CI goes green
  • Manually verify org_volterra_energy schemas contain actual tables afterwards

PR #2128 added a tenant_provisioning reset to force re-provisioning
on every deploy, but the reset targeted the wrong table. The
provisioning worker polls for pending tenants via
ListByStatus(StatusProvisioningPending) which reads tenant.status,
not tenant_provisioning.state. Resetting only tenant_provisioning
left tenant.status='active', so the worker never claimed the
tenant, provisioning never re-ran, and seed-dev's manifest apply
still failed against the empty tenant schemas.

Also: seed-dev's waitForTenantReady calls GetTenantProvisioningStatus
whose OverallStatus is derived from tenant.status. With tenant.status
stuck at 'active', seed-dev returned immediately from its wait loop
and fired the manifest apply before the worker had a chance to act.

Reset both tables in the same psql call:
- tenant.status = 'provisioning_pending' (triggers worker pickup and
  makes seed-dev wait for real completion).
- tenant_provisioning.service_schemas = '[]' (forces provisionAllServices
  to re-apply migrations instead of short-circuiting on "already
  provisioned" entries).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 5, 2026

📝 Walkthrough

Walkthrough

The workflow's post-migration tenant reset logic now updates two database tables instead of one. A new UPDATE statement on the tenant table is added before the existing tenant_provisioning update, expanding the SQL command from single to multi-statement execution.

Changes

Cohort / File(s) Summary
Post-migration Database Reset
.github/workflows/deploy-develop.yml
Updated tenant reset SQL logic to perform sequential updates on both tenant and tenant_provisioning tables. The tenant table is updated to set status to 'provisioning_pending' and update updated_at timestamp, followed by the existing tenant_provisioning table reset that sets state to 'pending' and clears service_schemas.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: resetting tenant.status in deploy-develop to allow worker re-provisioning, which is the primary objective of this PR.
Description check ✅ Passed The description is directly related to the changeset, providing detailed context about the issue, evidence, code references, and test plan for the database update changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-develop-deploy-tenant-status-reset

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-diagnosed fix. The root cause analysis is thorough: PR #2128 only reset tenant_provisioning.state but the provisioning worker polls tenant.status via ListByStatus(StatusProvisioningPending), so tenants with stale status='active' were never picked up for re-provisioning. The seed-dev wait loop also reads tenant.Status via GetTenantProvisioningStatus, so it returned immediately before the worker could act.

Verified:

  • 'provisioning_pending' matches domain.StatusProvisioningPending exactly (tenant.go:18)
  • 'deprovisioned' exclusion is correct — matches the terminal state constant (tenant.go:28)
  • The worker's processPendingTenants at provisioning_worker.go:292 confirms it queries by domain.StatusProvisioningPending
  • grpc_provisioning_endpoints.go:151 confirms OverallStatus is derived from tenant.Status, validating the seed-dev race diagnosis
  • Both UPDATEs run in a single psql -c call — they execute in the same implicit transaction, so the reset is atomic
  • updated_at = NOW() on the tenant table is a good addition for auditability
  • Comments are excellent — they explain why both tables must be reset in tandem and reference the specific source files

No concerns. This is a targeted CI workflow fix with zero application code changes and clear blast radius (develop deploy only).

@claude
Copy link
Copy Markdown

claude Bot commented Apr 5, 2026

Claude Code Review

Commit: a10ff90 | CI: running (security checks passing, CodeRabbit pending)

Summary

Correct fix for the deploy-develop provisioning failure. PR #2128 only reset tenant_provisioning.state but missed tenant.status, which is the field the provisioning worker actually polls. This caused a two-part failure: (1) worker never picked up tenants for re-provisioning, and (2) seed-dev's wait loop saw stale status='active' and proceeded before provisioning could run.

The fix adds a second UPDATE to reset tenant.status to 'provisioning_pending' alongside the existing tenant_provisioning reset, both in a single atomic psql call. Comments thoroughly document why both tables must be reset in tandem.

Risk Assessment

Area Level Detail
Blast radius Low develop deploy workflow only, no prod impact
Rollback Safe Revert restores previous (broken) behavior
Scale N/A CI workflow, not runtime code
Cross-system Low No application code changes
Migration N/A No migrations

Findings

No issues found. All status string values verified against domain constants.

Bot Review Notes

No unresolved bot review threads at time of review. CodeRabbit review still pending.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/deploy-develop.yml (1)

303-306: Fix correctly resets both tables to trigger re-provisioning.

The SQL properly:

  • Sets tenant.status = 'provisioning_pending' matching the StatusProvisioningPending constant the worker polls.
  • Clears service_schemas = '[]'::jsonb to bypass the "already provisioned" short-circuit.
  • Resets state = 'pending' and clears error_message.

Minor inconsistency: tenant updates updated_at = NOW(), but tenant_provisioning does not, despite also having an updated_at column. Consider adding it for consistency in audit/debugging.

🔧 Optional: Add updated_at to tenant_provisioning for consistency
             docker exec postgres-develop psql -U meridian -d meridian_platform -c "
               UPDATE tenant SET status = 'provisioning_pending', updated_at = NOW() WHERE status != 'deprovisioned';
-              UPDATE tenant_provisioning SET state = 'pending', service_schemas = '[]'::jsonb, error_message = '' WHERE state != 'deprovisioned';
+              UPDATE tenant_provisioning SET state = 'pending', service_schemas = '[]'::jsonb, error_message = '', updated_at = NOW() WHERE state != 'deprovisioned';
             "
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy-develop.yml around lines 303 - 306, The
tenant_provisioning UPDATE should also set updated_at so both tables reflect the
same audit timestamp; update the SQL in the workflow where you run the two
UPDATEs (the docker exec psql block) to include "updated_at = NOW()" in the
tenant_provisioning SET clause (alongside state = 'pending', service_schemas =
'[]'::jsonb, error_message = '') so it matches the tenant table behavior and the
worker polling against StatusProvisioningPending.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/deploy-develop.yml:
- Around line 303-306: The tenant_provisioning UPDATE should also set updated_at
so both tables reflect the same audit timestamp; update the SQL in the workflow
where you run the two UPDATEs (the docker exec psql block) to include
"updated_at = NOW()" in the tenant_provisioning SET clause (alongside state =
'pending', service_schemas = '[]'::jsonb, error_message = '') so it matches the
tenant table behavior and the worker polling against StatusProvisioningPending.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e6c5e60e-afb6-44cb-b33e-58537352b2cd

📥 Commits

Reviewing files that changed from the base of the PR and between 7964e77 and a10ff90.

📒 Files selected for processing (1)
  • .github/workflows/deploy-develop.yml

@bjcoombs bjcoombs merged commit c49d82b into develop Apr 5, 2026
21 checks passed
@bjcoombs bjcoombs deleted the fix-develop-deploy-tenant-status-reset branch April 5, 2026 08:43
bjcoombs added a commit that referenced this pull request Apr 6, 2026
The demo deploy had the same broken reset as develop had before PRs
#2131 and #2133:
- Only reset tenant_provisioning.state, not tenant.status (worker
  polls tenant.status via ListByStatus(StatusProvisioningPending))
- Didn't clear service_schemas JSONB (provisioner short-circuits on
  "service already provisioned, skipping")
- Didn't drop ghost schemas from previous broken runs
- Didn't stop the app before resetting (race between worker and
  schema drops)

Port the same fix from deploy-develop.yml:
- Stop meridian before resetting state
- DROP SCHEMA CASCADE for org_* schemas across all service databases
- Reset both tenant.status and tenant_provisioning.service_schemas
- Start meridian after reset
bjcoombs added a commit that referenced this pull request Apr 6, 2026
)

The demo deploy had the same broken reset as develop had before PRs
#2131 and #2133:
- Only reset tenant_provisioning.state, not tenant.status (worker
  polls tenant.status via ListByStatus(StatusProvisioningPending))
- Didn't clear service_schemas JSONB (provisioner short-circuits on
  "service already provisioned, skipping")
- Didn't drop ghost schemas from previous broken runs
- Didn't stop the app before resetting (race between worker and
  schema drops)

Port the same fix from deploy-develop.yml:
- Stop meridian before resetting state
- DROP SCHEMA CASCADE for org_* schemas across all service databases
- Reset both tenant.status and tenant_provisioning.service_schemas
- Start meridian after reset

Co-authored-by: Ben Coombs <bjcoombs@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant