Operational Runbooks

Step-by-step procedures for common operational scenarios. Each runbook includes symptoms, diagnosis steps, and resolution actions.

Prerequisites: Before running kubectl commands, set up cluster access with gco stacks access -r <region>. This configures your kubeconfig and sets the current context to the target cluster.

Region Goes Unhealthy
Secret Rotation Fails
Global Accelerator Stops Routing to a Region
SQS Dead Letter Queue Filling Up
Manifest Processor Rejecting Valid Jobs
High API Gateway Latency
EKS Cluster Unreachable
Inference Endpoint Not Serving Traffic
Cost Spike Detection

Region Goes Unhealthy

Symptoms: gco capacity status shows a region as unhealthy. Global Accelerator stops routing traffic to the region. Cross-region aggregator returns errors for the affected region.

Diagnosis:

# 1. Check health from the CLI
gco jobs list -r <region>

# 2. Check the health endpoint directly
gco capacity status

# 3. Check CloudWatch alarms in the monitoring dashboard
# Look for: EKS CPU/memory alarms, ALB unhealthy hosts, Lambda errors

# 4. Check EKS cluster status
aws eks describe-cluster --name gco-<region> --region <region> \
  --query 'cluster.status'

# 5. Check node health (if cluster is reachable)
kubectl get nodes

Resolution:

If EKS API is unreachable: Check VPC networking, security groups, and EKS control plane status in the AWS console. EKS Auto Mode manages nodes automatically — if the control plane is healthy, nodes should recover.
If ALB health checks are failing: Check the health monitor and manifest processor pods:
```
kubectl get pods -n gco-system
kubectl logs -n gco-system deployment/health-monitor
```
If nodes are NotReady: EKS Auto Mode should replace unhealthy nodes automatically. Check CloudWatch for node group scaling events. If stuck, check the NodePool configuration:
```
kubectl get nodepools
```
If the region is permanently degraded: Traffic is automatically routed to healthy regions via Global Accelerator. No immediate action required for availability, but investigate root cause.

Escalation: If the cluster is completely unreachable and not recovering after 15 minutes, check the AWS Health Dashboard for regional service issues.

Secret Rotation Fails

Symptoms: CloudWatch alarm fires for secret rotation failure. API requests start failing with 403 after the old secret expires. The secret-rotation Lambda shows errors in CloudWatch Logs.

Diagnosis:

# 1. Check the rotation Lambda logs
aws logs filter-log-events \
  --log-group-name /aws/lambda/gco-secret-rotation \
  --filter-pattern "ERROR" \
  --start-time $(date -d '1 hour ago' +%s000)

# 2. Check the secret's rotation status
aws secretsmanager describe-secret \
  --secret-id gco-auth-token \
  --query '{LastRotated: LastRotatedDate, NextRotation: NextRotationDate, Versions: VersionIdsToStages}'

# 3. Check if AWSPENDING version exists (stuck rotation)
aws secretsmanager get-secret-value \
  --secret-id gco-auth-token \
  --version-stage AWSPENDING 2>&1

Resolution:

If rotation Lambda is failing: Check IAM permissions on the rotation Lambda role. It needs secretsmanager:GetSecretValue, secretsmanager:PutSecretValue, and secretsmanager:UpdateSecretVersionStage.

If rotation is stuck (AWSPENDING exists but never promoted):

# Cancel the stuck rotation
aws secretsmanager cancel-rotate-secret --secret-id gco-auth-token

# Trigger a fresh rotation
aws secretsmanager rotate-secret --secret-id gco-auth-token

If API requests are failing with 403 right now: The auth middleware caches tokens for 5 minutes. After fixing the secret, wait up to 5 minutes for caches to refresh, or restart the manifest-processor pods to force a cache clear:
```
kubectl rollout restart deployment/manifest-processor -n gco-system
```

Prevention: The monitoring stack includes a CloudWatch alarm for rotation failures. Ensure the SNS topic has subscribers.

Global Accelerator Stops Routing to a Region

Symptoms: Traffic is not reaching a specific region even though the EKS cluster is healthy. gco capacity status shows the region as healthy but no jobs are landing there.

Diagnosis:

# 1. Check GA endpoint group health
aws globalaccelerator list-endpoint-groups \
  --listener-arn <listener-arn> \
  --query 'EndpointGroups[].{Region:EndpointGroupRegion,Health:HealthState}'

# 2. Check if the ALB is registered with GA
aws globalaccelerator list-custom-routing-endpoints \
  --endpoint-group-arn <endpoint-group-arn>

# 3. Check ALB health in the region
aws elbv2 describe-target-health \
  --target-group-arn <target-group-arn> \
  --region <region>

# 4. Check the GA registration Lambda logs
aws logs filter-log-events \
  --log-group-name /aws/lambda/gco-ga-registration-<region> \
  --filter-pattern "ERROR" \
  --region <region>

Resolution:

If ALB is not registered: The GA registration Lambda runs during stack deployment. Trigger a stack update to re-register:
```
gco stacks deploy -r <region> -y
```
If ALB health checks are failing: GA health checks hit /api/v1/health on the ALB. Check that the health monitor pod is running and the ALB target group has healthy targets.
If GA endpoint is unhealthy: Check the health check configuration in cdk.json under global_accelerator. The grace period and interval may need adjustment if the region takes longer to warm up.

SQS Dead Letter Queue Filling Up

Symptoms: gco queue stats shows messages in the DLQ. Jobs submitted via SQS are not being processed. The queue processor logs show repeated failures.

Diagnosis:

# 1. Check queue status
gco queue stats

# 2. Check DLQ message count
aws sqs get-queue-attributes \
  --queue-url <dlq-url> \
  --attribute-names ApproximateNumberOfMessages \
  --region <region>

# 3. Sample a DLQ message to see the failure reason
aws sqs receive-message \
  --queue-url <dlq-url> \
  --max-number-of-messages 1 \
  --region <region>

# 4. Check queue processor logs
kubectl logs -n gco-system deployment/sqs-consumer --tail=100

Resolution:

If messages are malformed YAML: The DLQ message body contains the original manifest. Fix the YAML and resubmit via gco jobs submit-sqs.

If the queue processor is crashing: Check pod status and restart:

kubectl get pods -n gco-system -l app=sqs-consumer
kubectl rollout restart deployment/sqs-consumer -n gco-system

If messages are valid but failing validation: Check resource limits in cdk.json under manifest_processor. The job may exceed CPU/memory/GPU limits.

To replay DLQ messages (after fixing the root cause):

# Move messages from DLQ back to main queue
aws sqs start-message-move-task \
  --source-arn <dlq-arn> \
  --destination-arn <main-queue-arn> \
  --region <region>

Prevention: The monitoring stack deploys a CloudWatch alarm on ApproximateNumberOfMessagesVisible for the DLQ. If the alarm fires, messages are accumulating — follow the diagnosis steps above.

Manifest Processor Rejecting Valid Jobs

Symptoms: Job submissions return validation errors even though the manifest looks correct. Common errors: "CPU exceeds max", "Namespace not allowed", "Untrusted image source".

Diagnosis:

# 1. Dry-run the manifest to see the exact error
gco jobs submit my-job.yaml -n gco-jobs --dry-run

# 2. Check current resource limits
gco config-cmd show | grep -A5 resource_quotas

# 3. Check allowed namespaces
gco config-cmd show | grep -A5 allowed_namespaces

Resolution:

Resource limit exceeded: Update resource_quotas in cdk.json and redeploy:

"manifest_processor": {
  "max_cpu_per_manifest": "32",
  "max_memory_per_manifest": "128Gi",
  "max_gpu_per_manifest": 8
}

Then: gco stacks deploy -r <region> -y

Namespace not allowed: Add the namespace to allowed_namespaces in cdk.json:

"manifest_processor": {
  "allowed_namespaces": ["default", "gco-jobs", "my-namespace"]
}

Untrusted image source: Add the registry to trusted_registries in cdk.json:

"manifest_processor": {
  "trusted_registries": ["docker.io", "gcr.io", "my-registry.example.com"]
}

High API Gateway Latency

Symptoms: API requests take >5 seconds. CloudWatch shows elevated Latency metric on the API Gateway. Users report slow gco jobs submit commands.

Diagnosis:

# 1. Check API Gateway latency metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name Latency \
  --dimensions Name=ApiName,Value=gco-global \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 --statistics Average p99

# 2. Check X-Ray traces (if tracing is enabled)
# Open X-Ray console → Traces → filter by service "gco"

# 3. Check Lambda cold starts
aws logs filter-log-events \
  --log-group-name /aws/lambda/gco-api-proxy \
  --filter-pattern "INIT_START"

Resolution:

Lambda cold starts: The proxy Lambda has a 29s timeout. Cold starts add 1-3s. If cold starts are frequent, consider provisioned concurrency:

aws lambda put-provisioned-concurrency-config \
  --function-name gco-api-proxy \
  --qualifier $LATEST \
  --provisioned-concurrent-executions 5

Global Accelerator routing latency: Check if traffic is being routed to the nearest region. Use traceroute to the GA endpoint to verify.
ALB target response time: Check the ALB TargetResponseTime metric. If the manifest processor is slow, check pod resource utilization and consider scaling replicas.

EKS Cluster Unreachable

Symptoms: kubectl commands fail with connection errors. gco stacks list shows the cluster but gco jobs list -r <region> fails.

Diagnosis:

# 1. Check cluster status
aws eks describe-cluster --name gco-<region> --region <region> \
  --query 'cluster.{Status:status,Endpoint:endpoint,Access:resourcesVpcConfig.endpointPublicAccess}'

# 2. Check if your kubeconfig is current
gco stacks access -r <region>

# 3. Check VPC connectivity (if endpoint is private)
aws ec2 describe-vpc-endpoints \
  --filters Name=vpc-id,Values=<vpc-id> \
  --region <region>

Resolution:

If cluster status is not ACTIVE: Wait for the cluster to finish updating. EKS updates can take 10-20 minutes.
If kubeconfig is stale: Refresh it:
```
gco stacks access -r <region>
```
If endpoint access mode is PRIVATE: You need to be on the VPC or use the regional API Gateway:
```
gco jobs list -r <region> --regional-api
```

Inference Endpoint Not Serving Traffic

Symptoms: gco inference status <name> shows the endpoint but requests fail. Health checks return errors.

Diagnosis:

# 1. Check endpoint status
gco inference status <name>

# 2. Check pod health
gco inference health <name>

# 3. Check pod logs
kubectl logs -n gco-inference deployment/<name> --tail=50

# 4. Check if the model loaded successfully
gco inference models <name>

Resolution:

If pods are in CrashLoopBackOff: Check logs for OOM errors or model loading failures. Increase memory/GPU resources.
If pods are running but not ready: The readiness probe may be failing. Check if the model finished loading (large models can take 5-10 minutes).
If the service is unreachable: Check the Kubernetes Service and Ingress:
```
kubectl get svc,ingress -n gco-inference
```

Cost Spike Detection

Symptoms: gco costs summary shows unexpected increase. AWS Cost Explorer shows higher-than-expected charges.

Diagnosis:

# 1. Check cost breakdown by region
gco costs regions

# 2. Check cost trend
gco costs trend --days 14

# 3. Check for forgotten inference endpoints
gco inference list

# 4. Check for stuck jobs consuming GPU resources
gco jobs list --all-regions --status running

# 5. Check node pool sizes
gco nodepools list -r us-east-1

Resolution:

Forgotten inference endpoints: Stop or delete unused endpoints:
```
gco inference stop <name>
gco inference delete <name>
```

Stuck jobs: Delete completed/failed jobs that are still holding resources:

gco jobs bulk-delete --status completed --older-than-days 7 --all-regions --execute -y
gco jobs bulk-delete --status failed --older-than-days 3 --all-regions --execute -y

Unexpected node scaling: EKS Auto Mode scales nodes based on pending pods. Check if there are pods stuck in Pending that are triggering unnecessary scaling.

For ongoing monitoring: Set up AWS Budgets with alerts:

aws budgets create-budget \
  --account-id <account-id> \
  --budget file://budget.json \
  --notifications-with-subscribers file://notifications.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operational Runbooks

Table of Contents

Region Goes Unhealthy

Secret Rotation Fails

Global Accelerator Stops Routing to a Region

SQS Dead Letter Queue Filling Up

Manifest Processor Rejecting Valid Jobs

High API Gateway Latency

EKS Cluster Unreachable

Inference Endpoint Not Serving Traffic

Cost Spike Detection

FilesExpand file tree

RUNBOOKS.md

Latest commit

History

RUNBOOKS.md

File metadata and controls

Operational Runbooks

Table of Contents

Region Goes Unhealthy

Secret Rotation Fails

Global Accelerator Stops Routing to a Region

SQS Dead Letter Queue Filling Up

Manifest Processor Rejecting Valid Jobs

High API Gateway Latency

EKS Cluster Unreachable

Inference Endpoint Not Serving Traffic

Cost Spike Detection