Step-by-step procedures for common operational scenarios. Each runbook includes symptoms, diagnosis steps, and resolution actions.
Prerequisites: Before running kubectl commands, set up cluster access with
gco stacks access -r <region>. This configures your kubeconfig and sets the current context to the target cluster.
- Region Goes Unhealthy
- Secret Rotation Fails
- Global Accelerator Stops Routing to a Region
- SQS Dead Letter Queue Filling Up
- Manifest Processor Rejecting Valid Jobs
- High API Gateway Latency
- EKS Cluster Unreachable
- Inference Endpoint Not Serving Traffic
- Cost Spike Detection
Symptoms: gco capacity status shows a region as unhealthy. Global Accelerator stops routing traffic to the region. Cross-region aggregator returns errors for the affected region.
Diagnosis:
# 1. Check health from the CLI
gco jobs list -r <region>
# 2. Check the health endpoint directly
gco capacity status
# 3. Check CloudWatch alarms in the monitoring dashboard
# Look for: EKS CPU/memory alarms, ALB unhealthy hosts, Lambda errors
# 4. Check EKS cluster status
aws eks describe-cluster --name gco-<region> --region <region> \
--query 'cluster.status'
# 5. Check node health (if cluster is reachable)
kubectl get nodesResolution:
-
If EKS API is unreachable: Check VPC networking, security groups, and EKS control plane status in the AWS console. EKS Auto Mode manages nodes automatically — if the control plane is healthy, nodes should recover.
-
If ALB health checks are failing: Check the health monitor and manifest processor pods:
kubectl get pods -n gco-system kubectl logs -n gco-system deployment/health-monitor
-
If nodes are NotReady: EKS Auto Mode should replace unhealthy nodes automatically. Check CloudWatch for node group scaling events. If stuck, check the NodePool configuration:
kubectl get nodepools
-
If the region is permanently degraded: Traffic is automatically routed to healthy regions via Global Accelerator. No immediate action required for availability, but investigate root cause.
Escalation: If the cluster is completely unreachable and not recovering after 15 minutes, check the AWS Health Dashboard for regional service issues.
Symptoms: CloudWatch alarm fires for secret rotation failure. API requests start failing with 403 after the old secret expires. The secret-rotation Lambda shows errors in CloudWatch Logs.
Diagnosis:
# 1. Check the rotation Lambda logs
aws logs filter-log-events \
--log-group-name /aws/lambda/gco-secret-rotation \
--filter-pattern "ERROR" \
--start-time $(date -d '1 hour ago' +%s000)
# 2. Check the secret's rotation status
aws secretsmanager describe-secret \
--secret-id gco-auth-token \
--query '{LastRotated: LastRotatedDate, NextRotation: NextRotationDate, Versions: VersionIdsToStages}'
# 3. Check if AWSPENDING version exists (stuck rotation)
aws secretsmanager get-secret-value \
--secret-id gco-auth-token \
--version-stage AWSPENDING 2>&1Resolution:
-
If rotation Lambda is failing: Check IAM permissions on the rotation Lambda role. It needs
secretsmanager:GetSecretValue,secretsmanager:PutSecretValue, andsecretsmanager:UpdateSecretVersionStage. -
If rotation is stuck (AWSPENDING exists but never promoted):
# Cancel the stuck rotation aws secretsmanager cancel-rotate-secret --secret-id gco-auth-token # Trigger a fresh rotation aws secretsmanager rotate-secret --secret-id gco-auth-token
-
If API requests are failing with 403 right now: The auth middleware caches tokens for 5 minutes. After fixing the secret, wait up to 5 minutes for caches to refresh, or restart the manifest-processor pods to force a cache clear:
kubectl rollout restart deployment/manifest-processor -n gco-system
Prevention: The monitoring stack includes a CloudWatch alarm for rotation failures. Ensure the SNS topic has subscribers.
Symptoms: Traffic is not reaching a specific region even though the EKS cluster is healthy. gco capacity status shows the region as healthy but no jobs are landing there.
Diagnosis:
# 1. Check GA endpoint group health
aws globalaccelerator list-endpoint-groups \
--listener-arn <listener-arn> \
--query 'EndpointGroups[].{Region:EndpointGroupRegion,Health:HealthState}'
# 2. Check if the ALB is registered with GA
aws globalaccelerator list-custom-routing-endpoints \
--endpoint-group-arn <endpoint-group-arn>
# 3. Check ALB health in the region
aws elbv2 describe-target-health \
--target-group-arn <target-group-arn> \
--region <region>
# 4. Check the GA registration Lambda logs
aws logs filter-log-events \
--log-group-name /aws/lambda/gco-ga-registration-<region> \
--filter-pattern "ERROR" \
--region <region>Resolution:
-
If ALB is not registered: The GA registration Lambda runs during stack deployment. Trigger a stack update to re-register:
gco stacks deploy -r <region> -y
-
If ALB health checks are failing: GA health checks hit
/api/v1/healthon the ALB. Check that the health monitor pod is running and the ALB target group has healthy targets. -
If GA endpoint is unhealthy: Check the health check configuration in
cdk.jsonunderglobal_accelerator. The grace period and interval may need adjustment if the region takes longer to warm up.
Symptoms: gco queue stats shows messages in the DLQ. Jobs submitted via SQS are not being processed. The queue processor logs show repeated failures.
Diagnosis:
# 1. Check queue status
gco queue stats
# 2. Check DLQ message count
aws sqs get-queue-attributes \
--queue-url <dlq-url> \
--attribute-names ApproximateNumberOfMessages \
--region <region>
# 3. Sample a DLQ message to see the failure reason
aws sqs receive-message \
--queue-url <dlq-url> \
--max-number-of-messages 1 \
--region <region>
# 4. Check queue processor logs
kubectl logs -n gco-system deployment/sqs-consumer --tail=100Resolution:
-
If messages are malformed YAML: The DLQ message body contains the original manifest. Fix the YAML and resubmit via
gco jobs submit-sqs. -
If the queue processor is crashing: Check pod status and restart:
kubectl get pods -n gco-system -l app=sqs-consumer kubectl rollout restart deployment/sqs-consumer -n gco-system
-
If messages are valid but failing validation: Check resource limits in
cdk.jsonundermanifest_processor. The job may exceed CPU/memory/GPU limits. -
To replay DLQ messages (after fixing the root cause):
# Move messages from DLQ back to main queue aws sqs start-message-move-task \ --source-arn <dlq-arn> \ --destination-arn <main-queue-arn> \ --region <region>
Prevention: The monitoring stack deploys a CloudWatch alarm on ApproximateNumberOfMessagesVisible for the DLQ. If the alarm fires, messages are accumulating — follow the diagnosis steps above.
Symptoms: Job submissions return validation errors even though the manifest looks correct. Common errors: "CPU exceeds max", "Namespace not allowed", "Untrusted image source".
Diagnosis:
# 1. Dry-run the manifest to see the exact error
gco jobs submit my-job.yaml -n gco-jobs --dry-run
# 2. Check current resource limits
gco config-cmd show | grep -A5 resource_quotas
# 3. Check allowed namespaces
gco config-cmd show | grep -A5 allowed_namespacesResolution:
-
Resource limit exceeded: Update
resource_quotasincdk.jsonand redeploy:"manifest_processor": { "max_cpu_per_manifest": "32", "max_memory_per_manifest": "128Gi", "max_gpu_per_manifest": 8 }
Then:
gco stacks deploy -r <region> -y -
Namespace not allowed: Add the namespace to
allowed_namespacesincdk.json:"manifest_processor": { "allowed_namespaces": ["default", "gco-jobs", "my-namespace"] }
-
Untrusted image source: Add the registry to
trusted_registriesincdk.json:"manifest_processor": { "trusted_registries": ["docker.io", "gcr.io", "my-registry.example.com"] }
Symptoms: API requests take >5 seconds. CloudWatch shows elevated Latency metric on the API Gateway. Users report slow gco jobs submit commands.
Diagnosis:
# 1. Check API Gateway latency metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name Latency \
--dimensions Name=ApiName,Value=gco-global \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 --statistics Average p99
# 2. Check X-Ray traces (if tracing is enabled)
# Open X-Ray console → Traces → filter by service "gco"
# 3. Check Lambda cold starts
aws logs filter-log-events \
--log-group-name /aws/lambda/gco-api-proxy \
--filter-pattern "INIT_START"Resolution:
-
Lambda cold starts: The proxy Lambda has a 29s timeout. Cold starts add 1-3s. If cold starts are frequent, consider provisioned concurrency:
aws lambda put-provisioned-concurrency-config \ --function-name gco-api-proxy \ --qualifier $LATEST \ --provisioned-concurrent-executions 5 -
Global Accelerator routing latency: Check if traffic is being routed to the nearest region. Use
tracerouteto the GA endpoint to verify. -
ALB target response time: Check the ALB
TargetResponseTimemetric. If the manifest processor is slow, check pod resource utilization and consider scaling replicas.
Symptoms: kubectl commands fail with connection errors. gco stacks list shows the cluster but gco jobs list -r <region> fails.
Diagnosis:
# 1. Check cluster status
aws eks describe-cluster --name gco-<region> --region <region> \
--query 'cluster.{Status:status,Endpoint:endpoint,Access:resourcesVpcConfig.endpointPublicAccess}'
# 2. Check if your kubeconfig is current
gco stacks access -r <region>
# 3. Check VPC connectivity (if endpoint is private)
aws ec2 describe-vpc-endpoints \
--filters Name=vpc-id,Values=<vpc-id> \
--region <region>Resolution:
-
If cluster status is not ACTIVE: Wait for the cluster to finish updating. EKS updates can take 10-20 minutes.
-
If kubeconfig is stale: Refresh it:
gco stacks access -r <region>
-
If endpoint access mode is PRIVATE: You need to be on the VPC or use the regional API Gateway:
gco jobs list -r <region> --regional-api
Symptoms: gco inference status <name> shows the endpoint but requests fail. Health checks return errors.
Diagnosis:
# 1. Check endpoint status
gco inference status <name>
# 2. Check pod health
gco inference health <name>
# 3. Check pod logs
kubectl logs -n gco-inference deployment/<name> --tail=50
# 4. Check if the model loaded successfully
gco inference models <name>Resolution:
-
If pods are in CrashLoopBackOff: Check logs for OOM errors or model loading failures. Increase memory/GPU resources.
-
If pods are running but not ready: The readiness probe may be failing. Check if the model finished loading (large models can take 5-10 minutes).
-
If the service is unreachable: Check the Kubernetes Service and Ingress:
kubectl get svc,ingress -n gco-inference
Symptoms: gco costs summary shows unexpected increase. AWS Cost Explorer shows higher-than-expected charges.
Diagnosis:
# 1. Check cost breakdown by region
gco costs regions
# 2. Check cost trend
gco costs trend --days 14
# 3. Check for forgotten inference endpoints
gco inference list
# 4. Check for stuck jobs consuming GPU resources
gco jobs list --all-regions --status running
# 5. Check node pool sizes
gco nodepools list -r us-east-1Resolution:
-
Forgotten inference endpoints: Stop or delete unused endpoints:
gco inference stop <name> gco inference delete <name>
-
Stuck jobs: Delete completed/failed jobs that are still holding resources:
gco jobs bulk-delete --status completed --older-than-days 7 --all-regions --execute -y gco jobs bulk-delete --status failed --older-than-days 3 --all-regions --execute -y
-
Unexpected node scaling: EKS Auto Mode scales nodes based on pending pods. Check if there are pods stuck in Pending that are triggering unnecessary scaling.
-
For ongoing monitoring: Set up AWS Budgets with alerts:
aws budgets create-budget \ --account-id <account-id> \ --budget file://budget.json \ --notifications-with-subscribers file://notifications.json