Skip to content

fix(ci): cloud-deploy-backend migrations fail fast when secret missing#7853

Merged
lalalune merged 1 commit into
developfrom
fix/cloud-deploy-backend-fail-fast-on-missing-db-secret
May 21, 2026
Merged

fix(ci): cloud-deploy-backend migrations fail fast when secret missing#7853
lalalune merged 1 commit into
developfrom
fix/cloud-deploy-backend-fail-fast-on-missing-db-secret

Conversation

@standujar
Copy link
Copy Markdown
Collaborator

@standujar standujar commented May 20, 2026

Root cause incident (2026-05-20)

User `sukimyfun` reported being billed $20 by Stripe but not credited. Investigation surfaced a deeper issue: the `migrate-db` job in `cloud-deploy-backend.yml` has been silently skipping every migration since 2026-04-22.

Trail:

  1. Worker observability showed `POST /api/stripe/webhook → 500` with `[CloudApi] Unhandled error`, plus repeated `[AiBillingRecordsRepository] insert failed` and `[Chat Completions] audit record failed` on every chat completion.

  2. Querying Neon prod: table `ai_billing_records` did not exist.

  3. `__drizzle_migrations` showed last applied migration = Abstract image descriptions / recognition to use any model provider #72 (`2026-04-22`), while the develop branch is at fix(deps): update dependency @mdx-js/react to v3.1.0 #11846+ migrations missing in prod.

  4. Inspecting recent `cloud-deploy-backend.yml` CI runs showed the migrate-db job succeeding but the actual command line read:

    ```
    MIGRATION_DATABASE_URL:
    Step "Skip migrations when database secret is unavailable" ran (no-op).
    ✓ migrate-db job marked SUCCESS.
    ```

The workflow had `if: env.MIGRATION_DATABASE_URL == ''` on the skip step and `if: env.MIGRATION_DATABASE_URL != ''` on the actual migrate step — so when the secret was unset, both the migrate step and the Discord notify steps were silently bypassed and the overall job exited green.

The secret `DATABASE_URL` (or `RAILWAY_DATABASE_URL` / `NEON_DATABASE_URL`) was simply never added to the `production` GitHub environment after the cloud migration into the elizaOS/eliza monorepo.

What this PR changes

  • The "skip migrations" step is replaced by `Fail fast when database secret is missing` — it emits a `::error::` annotation pointing at exactly where to add the secret and `exit 1`. A deploy that ships code expecting a newer schema MUST NOT succeed when the schema can't be migrated to match.
  • The `Run migrations` step drops its `if: ... != ''` guard, since the previous step now blocks when the URL is empty.
  • The Discord notify steps drop the same guard so failures get reported instead of being swallowed.

What this PR doesn't do (follow-ups on Stan)

  • Add `DATABASE_URL` to the `production` environment secrets. This is the action that actually unblocks the next deploy. Settings → Environments → `production` → Environment secrets.
  • Decide on migration `0110_finalize_steward_user_id_drop_privy.sql`. It aborts intentionally because 2657 active users still have `privy_user_id` but no `steward_user_id` — dropping the column would orphan them. Either backfill Steward IDs for those users (preferred) or split the migration so the column drop is gated separately.
  • Investigate the still-firing `[ContentModeration] Background moderation failed` and `POST /api/auth/steward-refresh → 502` errors. Both are unrelated to migrations — different root cause (external API call + upstream Steward server).

Already done out-of-band

  • Migrations `0115_add_sensitive_requests`, `0116_add_payment_requests`, `0117_add_voice_imprints`, `0118_add_ai_billing_and_alert_events` were applied manually to prod via `psql` to stop the live error cascade. Effective at 17:31 UTC — `AiBillingRecordsRepository` errors went from ~1/sec to zero immediately after.
  • `sukimyfun` was manually credited $20 (transaction id `b5fdb08c-4074-4522-a918-a8d8d4ee3b72`) since their Stripe webhook event was dropped before the fix.

Greptile Summary

This PR replaces a silent migration skip with a hard failure when the DATABASE_URL/RAILWAY_DATABASE_URL/NEON_DATABASE_URL secret is absent from the target GitHub environment. The root cause was a silent no-op on the migrate-db job that let 46+ migrations accumulate against production unnoticed from 2026-04-22 onward.

  • Fail-fast step: replaces the old "skip" step with exit 1 + ::error:: annotation, ensuring a missing secret blocks the deploy rather than silently passing the job as green.
  • Run migrations: drops the now-redundant if: MIGRATION_DATABASE_URL != '' guard since the prior step already terminates the job when the URL is empty.
  • Discord notifications: both success and failure notify steps drop the && MIGRATION_DATABASE_URL != '' guard so failures are reported instead of swallowed.

Confidence Score: 5/5

Safe to merge — the change is a single-file CI fix that makes a previously silent no-op fail loudly, with no effect on application code.

The three-line logical change is internally consistent: the fail-fast step blocks the job on a missing secret, so the downstream Run migrations step's old guard is correctly redundant, and both Discord notification steps behave correctly under success() / failure() for all reachable states (secret present, secret absent, real migration error). No edge case was found that could silently pass or double-notify.

No files require special attention; the workflow change is minimal and targeted.

Important Files Changed

Filename Overview
.github/workflows/cloud-deploy-backend.yml Silent migration skip replaced with a hard exit 1; Discord notification guards and Run migrations if guard correctly removed; logic on both happy and failure paths is sound.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[migrate-db job starts] --> B{MIGRATION_DATABASE_URL set?}
    B -- "No (was: silent skip)" --> C["❌ Fail fast\n::error:: annotation\nexit 1"]
    C --> D["Notify Discord ❌ Failure\n(if: failure())"]
    C --> E["Run migrations\n(skipped — prev step failed)"]
    B -- "Yes" --> F["'Fail fast' step\nskipped (if == '' is false)"]
    F --> G["Run migrations\nbun run db:cloud:migrate"]
    G -- success --> H["Notify Discord ✅ Success\n(if: success())"]
    G -- failure --> I["Notify Discord ❌ Failure\n(if: failure())"]
Loading

Reviews (1): Last reviewed commit: "fix(ci): cloud-deploy-backend migrations..." | Re-trigger Greptile

The migrate-db job used to silently skip when DATABASE_URL /
RAILWAY_DATABASE_URL / NEON_DATABASE_URL weren't set on the
environment. Result: 46+ migrations silently accumulated against
Neon prod from 2026-04-22 onward — last applied migration was #72
(2026-04-22) while develop was already at #118.

Worker prod was running code that expected newer tables
(ai_billing_records, sensitive_requests, payment_requests,
voice_imprint_clusters, etc.) which didn't exist. Cascade:
  - `[AiBillingRecordsRepository] insert failed` on every chat completion
  - `[Chat Completions] audit record failed`
  - `POST /api/stripe/webhook → 500` (sukimyfun's $20 top-up dropped)

A deploy that ships code requiring a newer schema must not succeed
if the schema can't be migrated to match. Replace the silent-skip
step with an explicit `exit 1` + GH error annotation pointing at the
exact secret to set.

Drops the `if: env.MIGRATION_DATABASE_URL != ''` guards on the
Discord notifications — they no longer have to defend against the
"silent no-op" path.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 80eaf891-1d6c-4dfa-bd2c-55dbf42a36d4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cloud-deploy-backend-fail-fast-on-missing-db-secret

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the ci label May 20, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 20, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@lalalune lalalune merged commit ef29942 into develop May 21, 2026
22 of 28 checks passed
@lalalune lalalune deleted the fix/cloud-deploy-backend-fail-fast-on-missing-db-secret branch May 21, 2026 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants