Skip to content

planx: harden Hasura boot against slow Aurora cold starts#218

Merged
chrisns merged 1 commit into
mainfrom
fix/planx-hasura-circuit-breaker
May 8, 2026
Merged

planx: harden Hasura boot against slow Aurora cold starts#218
chrisns merged 1 commit into
mainfrom
fix/planx-hasura-circuit-breaker

Conversation

@chrisns
Copy link
Copy Markdown
Member

@chrisns chrisns commented May 8, 2026

Summary

A planx StackSet CREATE into a fresh sandbox account (214888068391) failed today with ECS Deployment Circuit Breaker was triggered on ComputeHasuraService29F8570B. The instance was rolled back so we lost the per-task CloudWatch logs, but the failure mode is well-understood from the planx scenario history:

  1. Aurora cold starts on a fresh sandbox can take 5–10 minutes.
  2. entrypoint-wrapper.sh waited 5 minutes for DNS + pg_isready, then logged WARNING: ... Continuing anyway... and started Hasura against a still-cold DB. Hasura crashed, ECS restarted, restart loop → circuit breaker.
  3. circuitBreaker: { rollback: true } rolled the stack back, deleting the logs that would have proved it.

Changes

  • docker/hasura/entrypoint-wrapper.sh — DNS + pg_isready waits extended from 5 to 10 minutes each, and now exit 1 on timeout instead of "continue anyway". A restarted ECS task re-resolves DNS, so exit + retry fits the lifecycle cleanly; the previous behaviour just guaranteed a doomed Hasura process.
  • cdk/lib/constructs/compute.ts — Hasura healthCheckGracePeriod bumped 15 → 30 min, and circuit breaker switched to { enable: true, rollback: false }. ECS still stops piling on tasks after the threshold, but the stack stays in CREATE_FAILED with logs intact rather than ROLLBACK_COMPLETE.

The other three services (api, sharedb, editor) keep their default circuit breaker behaviour — only Hasura has the fresh-DB cold-start problem.

Test plan

  • cdk synth shows HealthCheckGracePeriodSeconds: 1800 and DeploymentCircuitBreaker.Rollback: false on Hasura's service.
  • Next failing fresh deploy will leave the stack in CREATE_FAILED with /ndx-planx/production log group still present, so we can read what Hasura actually said.
  • StackSet template + image already pushed to hub blueprints; future isb assign ndx-try-planx provisions with the new behaviour.

A planx StackSet CREATE op into account 214888068391 failed with
"ECS Deployment Circuit Breaker was triggered" on the Hasura service.
ECS doesn't surface a per-task reason in the StackSet output, but the
existing failure mode is well-understood:

1. Aurora can take 5-10 minutes to come up on a fresh sandbox account.
2. entrypoint-wrapper.sh waited 5 minutes for DNS / pg_isready, then
   logged "WARNING: ... Continuing anyway..." and started Hasura
   regardless. Hasura then crashed connecting to a still-cold DB, ECS
   restarted the task, repeat → circuit breaker tripped.
3. circuitBreaker: { rollback: true } meant the entire stack got rolled
   back, deleting the very CloudWatch logs that would have told us this.

Two changes:

- entrypoint-wrapper.sh: extend the DNS + pg_isready waits to 10 minutes
  each, and exit non-zero on timeout instead of "continuing anyway". A
  fresh ECS-restarted task re-resolves DNS, so an exit fits the retry
  semantics cleanly. Continuing past a missing DB just guarantees a
  doomed Hasura process.
- compute.ts: bump healthCheckGracePeriod from 15 to 30 minutes for the
  Hasura service, and switch its circuit breaker to enable=true,
  rollback=false. ECS still stops piling on tasks once it gives up, but
  the stack stays in CREATE_FAILED with logs intact instead of
  ROLLBACK_COMPLETE with everything gone.

The other three services (api, sharedb, editor) keep their default
circuit breaker behaviour — only Hasura has the fresh-DB cold-start
problem.
@chrisns chrisns added this pull request to the merge queue May 8, 2026
Merged via the queue into main with commit e748ef5 May 8, 2026
14 checks passed
@chrisns chrisns deleted the fix/planx-hasura-circuit-breaker branch May 8, 2026 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant