planx: harden Hasura boot against slow Aurora cold starts#218
Merged
Conversation
A planx StackSet CREATE op into account 214888068391 failed with
"ECS Deployment Circuit Breaker was triggered" on the Hasura service.
ECS doesn't surface a per-task reason in the StackSet output, but the
existing failure mode is well-understood:
1. Aurora can take 5-10 minutes to come up on a fresh sandbox account.
2. entrypoint-wrapper.sh waited 5 minutes for DNS / pg_isready, then
logged "WARNING: ... Continuing anyway..." and started Hasura
regardless. Hasura then crashed connecting to a still-cold DB, ECS
restarted the task, repeat → circuit breaker tripped.
3. circuitBreaker: { rollback: true } meant the entire stack got rolled
back, deleting the very CloudWatch logs that would have told us this.
Two changes:
- entrypoint-wrapper.sh: extend the DNS + pg_isready waits to 10 minutes
each, and exit non-zero on timeout instead of "continuing anyway". A
fresh ECS-restarted task re-resolves DNS, so an exit fits the retry
semantics cleanly. Continuing past a missing DB just guarantees a
doomed Hasura process.
- compute.ts: bump healthCheckGracePeriod from 15 to 30 minutes for the
Hasura service, and switch its circuit breaker to enable=true,
rollback=false. ECS still stops piling on tasks once it gives up, but
the stack stays in CREATE_FAILED with logs intact instead of
ROLLBACK_COMPLETE with everything gone.
The other three services (api, sharedb, editor) keep their default
circuit breaker behaviour — only Hasura has the fresh-DB cold-start
problem.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A planx StackSet CREATE into a fresh sandbox account (214888068391) failed today with
ECS Deployment Circuit Breaker was triggeredonComputeHasuraService29F8570B. The instance was rolled back so we lost the per-task CloudWatch logs, but the failure mode is well-understood from the planx scenario history:entrypoint-wrapper.shwaited 5 minutes for DNS +pg_isready, then loggedWARNING: ... Continuing anyway...and started Hasura against a still-cold DB. Hasura crashed, ECS restarted, restart loop → circuit breaker.circuitBreaker: { rollback: true }rolled the stack back, deleting the logs that would have proved it.Changes
docker/hasura/entrypoint-wrapper.sh— DNS +pg_isreadywaits extended from 5 to 10 minutes each, and nowexit 1on timeout instead of "continue anyway". A restarted ECS task re-resolves DNS, so exit + retry fits the lifecycle cleanly; the previous behaviour just guaranteed a doomed Hasura process.cdk/lib/constructs/compute.ts— HasurahealthCheckGracePeriodbumped 15 → 30 min, and circuit breaker switched to{ enable: true, rollback: false }. ECS still stops piling on tasks after the threshold, but the stack stays inCREATE_FAILEDwith logs intact rather thanROLLBACK_COMPLETE.The other three services (api, sharedb, editor) keep their default circuit breaker behaviour — only Hasura has the fresh-DB cold-start problem.
Test plan
cdk synthshowsHealthCheckGracePeriodSeconds: 1800andDeploymentCircuitBreaker.Rollback: falseon Hasura's service.CREATE_FAILEDwith/ndx-planx/productionlog group still present, so we can read what Hasura actually said.isb assign ndx-try-planxprovisions with the new behaviour.