planx: harden Hasura boot against slow Aurora cold starts

chrisns · chrisns · commit 7d19c8547738 · 2026-05-08T16:02:30.000+01:00
A planx StackSet CREATE op into account 214888068391 failed with
"ECS Deployment Circuit Breaker was triggered" on the Hasura service.
ECS doesn't surface a per-task reason in the StackSet output, but the
existing failure mode is well-understood:

1. Aurora can take 5-10 minutes to come up on a fresh sandbox account.
2. entrypoint-wrapper.sh waited 5 minutes for DNS / pg_isready, then
   logged "WARNING: ... Continuing anyway..." and started Hasura
   regardless. Hasura then crashed connecting to a still-cold DB, ECS
   restarted the task, repeat → circuit breaker tripped.
3. circuitBreaker: { rollback: true } meant the entire stack got rolled
   back, deleting the very CloudWatch logs that would have told us this.

Two changes:

- entrypoint-wrapper.sh: extend the DNS + pg_isready waits to 10 minutes
  each, and exit non-zero on timeout instead of "continuing anyway". A
  fresh ECS-restarted task re-resolves DNS, so an exit fits the retry
  semantics cleanly. Continuing past a missing DB just guarantees a
  doomed Hasura process.
- compute.ts: bump healthCheckGracePeriod from 15 to 30 minutes for the
  Hasura service, and switch its circuit breaker to enable=true,
  rollback=false. ECS still stops piling on tasks once it gives up, but
  the stack stays in CREATE_FAILED with logs intact instead of
  ROLLBACK_COMPLETE with everything gone.

The other three services (api, sharedb, editor) keep their default
circuit breaker behaviour — only Hasura has the fresh-DB cold-start
problem.
diff --git a/cloudformation/scenarios/planx/cdk/lib/constructs/compute.ts b/cloudformation/scenarios/planx/cdk/lib/constructs/compute.ts
@@ -171,9 +171,16 @@ export class ComputeConstruct extends Construct {
       vpcSubnets: { subnetType: ec2.SubnetType.PUBLIC },
       assignPublicIp: true,
       enableExecuteCommand: true,
-      circuitBreaker: { rollback: true },
+      // Don't roll the entire stack back when Hasura's first boot is slow.
+      // Aurora cold starts can be 5-10 minutes on a fresh sandbox; with
+      // rollback enabled, a couple of restart attempts trip the circuit
+      // breaker and the StackSet operation FAILS with no useful per-task
+      // reason (we lose the CloudWatch logs along with the rolled-back
+      // stack). Letting ECS keep retrying surfaces the actual task error
+      // and gives the entrypoint hard-wait loop time to win.
+      circuitBreaker: { enable: true, rollback: false },
       serviceName: 'NdxPlanx-Hasura',
-      healthCheckGracePeriod: cdk.Duration.minutes(15),
+      healthCheckGracePeriod: cdk.Duration.minutes(30),
     });
 
     // =========================================================================
diff --git a/cloudformation/scenarios/planx/docker/hasura/entrypoint-wrapper.sh b/cloudformation/scenarios/planx/docker/hasura/entrypoint-wrapper.sh
@@ -15,28 +15,37 @@ DB_NAME=$(echo "$DB_URL" | sed -n 's|.*/\([^?]*\).*|\1|p')
 echo "DB Host: $DB_HOST"
 echo "DB Name: $DB_NAME"
 
-# 1. Wait for DNS resolution (Aurora DNS can take a minute to propagate)
+# 1. Wait for DNS resolution (Aurora DNS can take a minute to propagate).
+# Hard-fail if it never resolves: continuing past this point with a missing
+# host means Hasura starts up, fails its DB connection, exits, ECS restarts,
+# and the deployment circuit breaker eventually trips and rolls back the
+# whole stack with no useful error. Better to have the container exit fast
+# and ECS restart it (re-resolving DNS each time) than to bleed boot time.
 echo "[1/3] Waiting for DNS resolution of $DB_HOST..."
-for i in $(seq 1 60); do
+for i in $(seq 1 120); do
   if nslookup "$DB_HOST" > /dev/null 2>&1; then
     echo "DNS resolved."
     break
   fi
-  if [ $i -eq 60 ]; then
-    echo "WARNING: DNS resolution timed out after 5 minutes. Continuing anyway..."
+  if [ $i -eq 120 ]; then
+    echo "ERROR: DNS resolution failed after 10 minutes; exiting so ECS can retry."
+    exit 1
   fi
   sleep 5
 done
 
-# 2. Wait for PostgreSQL to accept connections
+# 2. Wait for PostgreSQL to accept connections. Same hard-fail rationale as
+# above — proceeding when PG is unreachable just guarantees a Hasura crash
+# loop and a circuit-breaker rollback.
 echo "[2/3] Waiting for PostgreSQL to be ready..."
-for i in $(seq 1 60); do
+for i in $(seq 1 120); do
   if PGPASSWORD="$DB_PASSWORD" pg_isready -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" > /dev/null 2>&1; then
     echo "PostgreSQL ready."
     break
   fi
-  if [ $i -eq 60 ]; then
-    echo "WARNING: PostgreSQL not ready after 5 minutes. Continuing anyway..."
+  if [ $i -eq 120 ]; then
+    echo "ERROR: PostgreSQL not ready after 10 minutes; exiting so ECS can retry."
+    exit 1
   fi
   sleep 5
 done