fix(ci): cloud-deploy-backend migrations fail fast when secret missing#7853
Conversation
The migrate-db job used to silently skip when DATABASE_URL / RAILWAY_DATABASE_URL / NEON_DATABASE_URL weren't set on the environment. Result: 46+ migrations silently accumulated against Neon prod from 2026-04-22 onward — last applied migration was #72 (2026-04-22) while develop was already at #118. Worker prod was running code that expected newer tables (ai_billing_records, sensitive_requests, payment_requests, voice_imprint_clusters, etc.) which didn't exist. Cascade: - `[AiBillingRecordsRepository] insert failed` on every chat completion - `[Chat Completions] audit record failed` - `POST /api/stripe/webhook → 500` (sukimyfun's $20 top-up dropped) A deploy that ships code requiring a newer schema must not succeed if the schema can't be migrated to match. Replace the silent-skip step with an explicit `exit 1` + GH error annotation pointing at the exact secret to set. Drops the `if: env.MIGRATION_DATABASE_URL != ''` guards on the Discord notifications — they no longer have to defend against the "silent no-op" path.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Root cause incident (2026-05-20)
User `sukimyfun` reported being billed $20 by Stripe but not credited. Investigation surfaced a deeper issue: the `migrate-db` job in `cloud-deploy-backend.yml` has been silently skipping every migration since 2026-04-22.
Trail:
Worker observability showed `POST /api/stripe/webhook → 500` with `[CloudApi] Unhandled error`, plus repeated `[AiBillingRecordsRepository] insert failed` and `[Chat Completions] audit record failed` on every chat completion.
Querying Neon prod: table `ai_billing_records` did not exist.
`__drizzle_migrations` showed last applied migration = Abstract image descriptions / recognition to use any model provider #72 (`2026-04-22`), while the develop branch is at fix(deps): update dependency @mdx-js/react to v3.1.0 #118 — 46+ migrations missing in prod.
Inspecting recent `cloud-deploy-backend.yml` CI runs showed the migrate-db job succeeding but the actual command line read:
```
MIGRATION_DATABASE_URL:
Step "Skip migrations when database secret is unavailable" ran (no-op).
✓ migrate-db job marked SUCCESS.
```
The workflow had `if: env.MIGRATION_DATABASE_URL == ''` on the skip step and `if: env.MIGRATION_DATABASE_URL != ''` on the actual migrate step — so when the secret was unset, both the migrate step and the Discord notify steps were silently bypassed and the overall job exited green.
The secret `DATABASE_URL` (or `RAILWAY_DATABASE_URL` / `NEON_DATABASE_URL`) was simply never added to the `production` GitHub environment after the cloud migration into the elizaOS/eliza monorepo.
What this PR changes
What this PR doesn't do (follow-ups on Stan)
Already done out-of-band
Greptile Summary
This PR replaces a silent migration skip with a hard failure when the
DATABASE_URL/RAILWAY_DATABASE_URL/NEON_DATABASE_URLsecret is absent from the target GitHub environment. The root cause was a silent no-op on themigrate-dbjob that let 46+ migrations accumulate against production unnoticed from 2026-04-22 onward.exit 1+::error::annotation, ensuring a missing secret blocks the deploy rather than silently passing the job as green.if: MIGRATION_DATABASE_URL != ''guard since the prior step already terminates the job when the URL is empty.&& MIGRATION_DATABASE_URL != ''guard so failures are reported instead of swallowed.Confidence Score: 5/5
Safe to merge — the change is a single-file CI fix that makes a previously silent no-op fail loudly, with no effect on application code.
The three-line logical change is internally consistent: the fail-fast step blocks the job on a missing secret, so the downstream Run migrations step's old guard is correctly redundant, and both Discord notification steps behave correctly under success() / failure() for all reachable states (secret present, secret absent, real migration error). No edge case was found that could silently pass or double-notify.
No files require special attention; the workflow change is minimal and targeted.
Important Files Changed
exit 1; Discord notification guards andRun migrationsifguard correctly removed; logic on both happy and failure paths is sound.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[migrate-db job starts] --> B{MIGRATION_DATABASE_URL set?} B -- "No (was: silent skip)" --> C["❌ Fail fast\n::error:: annotation\nexit 1"] C --> D["Notify Discord ❌ Failure\n(if: failure())"] C --> E["Run migrations\n(skipped — prev step failed)"] B -- "Yes" --> F["'Fail fast' step\nskipped (if == '' is false)"] F --> G["Run migrations\nbun run db:cloud:migrate"] G -- success --> H["Notify Discord ✅ Success\n(if: success())"] G -- failure --> I["Notify Discord ❌ Failure\n(if: failure())"]Reviews (1): Last reviewed commit: "fix(ci): cloud-deploy-backend migrations..." | Re-trigger Greptile