Skip to content

Latest commit

 

History

History
422 lines (287 loc) · 13.3 KB

File metadata and controls

422 lines (287 loc) · 13.3 KB

Operational Runbooks

Step-by-step procedures for common operational scenarios. Each runbook includes symptoms, diagnosis steps, and resolution actions.

Prerequisites: Before running kubectl commands, set up cluster access with gco stacks access -r <region>. This configures your kubeconfig and sets the current context to the target cluster.

Table of Contents


Region Goes Unhealthy

Symptoms: gco capacity status shows a region as unhealthy. Global Accelerator stops routing traffic to the region. Cross-region aggregator returns errors for the affected region.

Diagnosis:

# 1. Check health from the CLI
gco jobs list -r <region>

# 2. Check the health endpoint directly
gco capacity status

# 3. Check CloudWatch alarms in the monitoring dashboard
# Look for: EKS CPU/memory alarms, ALB unhealthy hosts, Lambda errors

# 4. Check EKS cluster status
aws eks describe-cluster --name gco-<region> --region <region> \
  --query 'cluster.status'

# 5. Check node health (if cluster is reachable)
kubectl get nodes

Resolution:

  1. If EKS API is unreachable: Check VPC networking, security groups, and EKS control plane status in the AWS console. EKS Auto Mode manages nodes automatically — if the control plane is healthy, nodes should recover.

  2. If ALB health checks are failing: Check the health monitor and manifest processor pods:

    kubectl get pods -n gco-system
    kubectl logs -n gco-system deployment/health-monitor
  3. If nodes are NotReady: EKS Auto Mode should replace unhealthy nodes automatically. Check CloudWatch for node group scaling events. If stuck, check the NodePool configuration:

    kubectl get nodepools
  4. If the region is permanently degraded: Traffic is automatically routed to healthy regions via Global Accelerator. No immediate action required for availability, but investigate root cause.

Escalation: If the cluster is completely unreachable and not recovering after 15 minutes, check the AWS Health Dashboard for regional service issues.


Secret Rotation Fails

Symptoms: CloudWatch alarm fires for secret rotation failure. API requests start failing with 403 after the old secret expires. The secret-rotation Lambda shows errors in CloudWatch Logs.

Diagnosis:

# 1. Check the rotation Lambda logs
aws logs filter-log-events \
  --log-group-name /aws/lambda/gco-secret-rotation \
  --filter-pattern "ERROR" \
  --start-time $(date -d '1 hour ago' +%s000)

# 2. Check the secret's rotation status
aws secretsmanager describe-secret \
  --secret-id gco-auth-token \
  --query '{LastRotated: LastRotatedDate, NextRotation: NextRotationDate, Versions: VersionIdsToStages}'

# 3. Check if AWSPENDING version exists (stuck rotation)
aws secretsmanager get-secret-value \
  --secret-id gco-auth-token \
  --version-stage AWSPENDING 2>&1

Resolution:

  1. If rotation Lambda is failing: Check IAM permissions on the rotation Lambda role. It needs secretsmanager:GetSecretValue, secretsmanager:PutSecretValue, and secretsmanager:UpdateSecretVersionStage.

  2. If rotation is stuck (AWSPENDING exists but never promoted):

    # Cancel the stuck rotation
    aws secretsmanager cancel-rotate-secret --secret-id gco-auth-token
    
    # Trigger a fresh rotation
    aws secretsmanager rotate-secret --secret-id gco-auth-token
  3. If API requests are failing with 403 right now: The auth middleware caches tokens for 5 minutes. After fixing the secret, wait up to 5 minutes for caches to refresh, or restart the manifest-processor pods to force a cache clear:

    kubectl rollout restart deployment/manifest-processor -n gco-system

Prevention: The monitoring stack includes a CloudWatch alarm for rotation failures. Ensure the SNS topic has subscribers.


Global Accelerator Stops Routing to a Region

Symptoms: Traffic is not reaching a specific region even though the EKS cluster is healthy. gco capacity status shows the region as healthy but no jobs are landing there.

Diagnosis:

# 1. Check GA endpoint group health
aws globalaccelerator list-endpoint-groups \
  --listener-arn <listener-arn> \
  --query 'EndpointGroups[].{Region:EndpointGroupRegion,Health:HealthState}'

# 2. Check if the ALB is registered with GA
aws globalaccelerator list-custom-routing-endpoints \
  --endpoint-group-arn <endpoint-group-arn>

# 3. Check ALB health in the region
aws elbv2 describe-target-health \
  --target-group-arn <target-group-arn> \
  --region <region>

# 4. Check the GA registration Lambda logs
aws logs filter-log-events \
  --log-group-name /aws/lambda/gco-ga-registration-<region> \
  --filter-pattern "ERROR" \
  --region <region>

Resolution:

  1. If ALB is not registered: The GA registration Lambda runs during stack deployment. Trigger a stack update to re-register:

    gco stacks deploy -r <region> -y
  2. If ALB health checks are failing: GA health checks hit /api/v1/health on the ALB. Check that the health monitor pod is running and the ALB target group has healthy targets.

  3. If GA endpoint is unhealthy: Check the health check configuration in cdk.json under global_accelerator. The grace period and interval may need adjustment if the region takes longer to warm up.


SQS Dead Letter Queue Filling Up

Symptoms: gco queue stats shows messages in the DLQ. Jobs submitted via SQS are not being processed. The queue processor logs show repeated failures.

Diagnosis:

# 1. Check queue status
gco queue stats

# 2. Check DLQ message count
aws sqs get-queue-attributes \
  --queue-url <dlq-url> \
  --attribute-names ApproximateNumberOfMessages \
  --region <region>

# 3. Sample a DLQ message to see the failure reason
aws sqs receive-message \
  --queue-url <dlq-url> \
  --max-number-of-messages 1 \
  --region <region>

# 4. Check queue processor logs
kubectl logs -n gco-system deployment/sqs-consumer --tail=100

Resolution:

  1. If messages are malformed YAML: The DLQ message body contains the original manifest. Fix the YAML and resubmit via gco jobs submit-sqs.

  2. If the queue processor is crashing: Check pod status and restart:

    kubectl get pods -n gco-system -l app=sqs-consumer
    kubectl rollout restart deployment/sqs-consumer -n gco-system
  3. If messages are valid but failing validation: Check resource limits in cdk.json under manifest_processor. The job may exceed CPU/memory/GPU limits.

  4. To replay DLQ messages (after fixing the root cause):

    # Move messages from DLQ back to main queue
    aws sqs start-message-move-task \
      --source-arn <dlq-arn> \
      --destination-arn <main-queue-arn> \
      --region <region>

Prevention: The monitoring stack deploys a CloudWatch alarm on ApproximateNumberOfMessagesVisible for the DLQ. If the alarm fires, messages are accumulating — follow the diagnosis steps above.


Manifest Processor Rejecting Valid Jobs

Symptoms: Job submissions return validation errors even though the manifest looks correct. Common errors: "CPU exceeds max", "Namespace not allowed", "Untrusted image source".

Diagnosis:

# 1. Dry-run the manifest to see the exact error
gco jobs submit my-job.yaml -n gco-jobs --dry-run

# 2. Check current resource limits
gco config-cmd show | grep -A5 resource_quotas

# 3. Check allowed namespaces
gco config-cmd show | grep -A5 allowed_namespaces

Resolution:

  1. Resource limit exceeded: Update resource_quotas in cdk.json and redeploy:

    "manifest_processor": {
      "max_cpu_per_manifest": "32",
      "max_memory_per_manifest": "128Gi",
      "max_gpu_per_manifest": 8
    }

    Then: gco stacks deploy -r <region> -y

  2. Namespace not allowed: Add the namespace to allowed_namespaces in cdk.json:

    "manifest_processor": {
      "allowed_namespaces": ["default", "gco-jobs", "my-namespace"]
    }
  3. Untrusted image source: Add the registry to trusted_registries in cdk.json:

    "manifest_processor": {
      "trusted_registries": ["docker.io", "gcr.io", "my-registry.example.com"]
    }

High API Gateway Latency

Symptoms: API requests take >5 seconds. CloudWatch shows elevated Latency metric on the API Gateway. Users report slow gco jobs submit commands.

Diagnosis:

# 1. Check API Gateway latency metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name Latency \
  --dimensions Name=ApiName,Value=gco-global \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 --statistics Average p99

# 2. Check X-Ray traces (if tracing is enabled)
# Open X-Ray console → Traces → filter by service "gco"

# 3. Check Lambda cold starts
aws logs filter-log-events \
  --log-group-name /aws/lambda/gco-api-proxy \
  --filter-pattern "INIT_START"

Resolution:

  1. Lambda cold starts: The proxy Lambda has a 29s timeout. Cold starts add 1-3s. If cold starts are frequent, consider provisioned concurrency:

    aws lambda put-provisioned-concurrency-config \
      --function-name gco-api-proxy \
      --qualifier $LATEST \
      --provisioned-concurrent-executions 5
  2. Global Accelerator routing latency: Check if traffic is being routed to the nearest region. Use traceroute to the GA endpoint to verify.

  3. ALB target response time: Check the ALB TargetResponseTime metric. If the manifest processor is slow, check pod resource utilization and consider scaling replicas.


EKS Cluster Unreachable

Symptoms: kubectl commands fail with connection errors. gco stacks list shows the cluster but gco jobs list -r <region> fails.

Diagnosis:

# 1. Check cluster status
aws eks describe-cluster --name gco-<region> --region <region> \
  --query 'cluster.{Status:status,Endpoint:endpoint,Access:resourcesVpcConfig.endpointPublicAccess}'

# 2. Check if your kubeconfig is current
gco stacks access -r <region>

# 3. Check VPC connectivity (if endpoint is private)
aws ec2 describe-vpc-endpoints \
  --filters Name=vpc-id,Values=<vpc-id> \
  --region <region>

Resolution:

  1. If cluster status is not ACTIVE: Wait for the cluster to finish updating. EKS updates can take 10-20 minutes.

  2. If kubeconfig is stale: Refresh it:

    gco stacks access -r <region>
  3. If endpoint access mode is PRIVATE: You need to be on the VPC or use the regional API Gateway:

    gco jobs list -r <region> --regional-api

Inference Endpoint Not Serving Traffic

Symptoms: gco inference status <name> shows the endpoint but requests fail. Health checks return errors.

Diagnosis:

# 1. Check endpoint status
gco inference status <name>

# 2. Check pod health
gco inference health <name>

# 3. Check pod logs
kubectl logs -n gco-inference deployment/<name> --tail=50

# 4. Check if the model loaded successfully
gco inference models <name>

Resolution:

  1. If pods are in CrashLoopBackOff: Check logs for OOM errors or model loading failures. Increase memory/GPU resources.

  2. If pods are running but not ready: The readiness probe may be failing. Check if the model finished loading (large models can take 5-10 minutes).

  3. If the service is unreachable: Check the Kubernetes Service and Ingress:

    kubectl get svc,ingress -n gco-inference

Cost Spike Detection

Symptoms: gco costs summary shows unexpected increase. AWS Cost Explorer shows higher-than-expected charges.

Diagnosis:

# 1. Check cost breakdown by region
gco costs regions

# 2. Check cost trend
gco costs trend --days 14

# 3. Check for forgotten inference endpoints
gco inference list

# 4. Check for stuck jobs consuming GPU resources
gco jobs list --all-regions --status running

# 5. Check node pool sizes
gco nodepools list -r us-east-1

Resolution:

  1. Forgotten inference endpoints: Stop or delete unused endpoints:

    gco inference stop <name>
    gco inference delete <name>
  2. Stuck jobs: Delete completed/failed jobs that are still holding resources:

    gco jobs bulk-delete --status completed --older-than-days 7 --all-regions --execute -y
    gco jobs bulk-delete --status failed --older-than-days 3 --all-regions --execute -y
  3. Unexpected node scaling: EKS Auto Mode scales nodes based on pending pods. Check if there are pods stuck in Pending that are triggering unnecessary scaling.

  4. For ongoing monitoring: Set up AWS Budgets with alerts:

    aws budgets create-budget \
      --account-id <account-id> \
      --budget file://budget.json \
      --notifications-with-subscribers file://notifications.json