Core Concepts

This guide explains the fundamental concepts behind GCO (Global Capacity Orchestrator on AWS). Read this before diving into the technical documentation.

What is GCO?
The Problem It Solves
Key Concepts
Inference Serving
- How Inference Works
- Supported Frameworks
Storage Options
- EFS (Elastic File System)
- FSx for Lustre
Security Model
How Components Work Together
Common Workflows

What is GCO?

GCO is a multi-region Kubernetes platform built on AWS EKS Auto Mode, designed specifically for AI/ML workloads that need GPU compute. It provides:

A single API endpoint that routes jobs to the best available region
Automatic GPU node provisioning (no manual scaling)
Inference endpoint management across regions with a single command
Shared storage for job outputs that persists after pods terminate
Production-ready security with IAM authentication

Think of it as a "GPU job submission service" — you submit a Kubernetes manifest, and GCO handles finding capacity, provisioning nodes, running your job, and storing outputs. For inference, you deploy an endpoint once and GCO serves it globally with automatic failover.

The Problem It Solves

Running GPU workloads at scale on Kubernetes is hard:

Challenge	Without GCO	With GCO
GPU availability	Manually check each region	Auto-routes to available capacity
Node provisioning	Pre-provision or wait for scaling	EKS Auto Mode provisions on-demand
Multi-region	Manage multiple clusters separately	Single API, automatic routing
Authentication	Configure per-cluster access	IAM-based, works with existing AWS credentials
Job outputs	Lost when pods terminate	Persisted to EFS/FSx storage
Inference serving	Deploy and manage per-region	Deploy once, serve globally with auto-failover
Failover	Manual intervention	Automatic via Global Accelerator

Key Concepts

Multi-Region Architecture

GCO deploys identical infrastructure to multiple AWS regions:

                    ┌─────────────────────┐
                    │   Global Endpoint   │
                    │  (API Gateway +     │
                    │  Global Accelerator)│
                    └──────────┬──────────┘
                               │
           ┌───────────────────┼───────────────────┐
           │                   │                   │
    ┌──────▼──────┐     ┌──────▼──────┐     ┌──────▼──────┐
    │  us-east-1  │     │  us-west-2  │     │  eu-west-1  │
    │  EKS + ALB  │     │  EKS + ALB  │     │  EKS + ALB  │
    └─────────────┘     └─────────────┘     └─────────────┘

Each region is independent - if one region has issues, traffic automatically routes to healthy regions.

EKS Auto Mode

EKS Auto Mode is AWS's fully managed Kubernetes compute. Unlike traditional EKS where you manage node groups, Auto Mode:

Automatically provisions nodes when pods are pending
Scales to zero when no workloads are running (cost savings)
Handles node updates and security patches
Supports GPU instances via nodepools

You don't manage EC2 instances directly - you define what you need (CPU, memory, GPU), and EKS Auto Mode handles the rest.

Nodepools

Nodepools define what types of nodes can be provisioned. GCO creates several:

Nodepool	Purpose	Instance Types
`system`	Kubernetes system components	Managed by EKS
`general-purpose`	Standard workloads	Various CPU instances
`gpu-x86`	NVIDIA GPU workloads	g4dn, g5 (T4, A10G GPUs)
`gpu-arm`	ARM64 GPU workloads	g5g (A10G GPUs)
`inference`	Long-running inference endpoints	Same as gpu-x86, WhenEmpty consolidation
`gpu-efa-pool`	Distributed training and high-performance inference	p4d, p5 (A100, H100 with EFA)

When you submit a job requesting a GPU, EKS Auto Mode finds the right nodepool and provisions an appropriate instance.

Manifest Submission

A "manifest" is a Kubernetes YAML file describing your workload. GCO accepts manifests via:

SQS Queue (recommended for production) - Reliable, region-targeted submission
API Gateway - IAM-authenticated REST API with global routing
DynamoDB Job Queue - Centralized global queue with priority, status tracking, and audit trail
Direct kubectl - Requires cluster access, good for development

Example manifest:

apiVersion: batch/v1
kind: Job
metadata:
  name: my-training-job
  namespace: gco-jobs
spec:
  template:
    spec:
      containers:
      - name: trainer
        image: my-training-image:latest
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: Never

Global Routing

AWS Global Accelerator provides a single endpoint that routes to the nearest healthy region:

User submits job to global endpoint
Global Accelerator checks health of each region
Routes to nearest healthy region (lowest latency)
If a region fails, automatically routes to next-best region

This happens transparently - you always use the same endpoint.

Inference Serving

Beyond batch GPU jobs, GCO supports long-running inference endpoints — deploy a model once and serve it globally across regions. The platform handles reconciliation, model weight syncing, routing, and scaling.

How Inference Works

Inference serving uses a reconciliation pattern similar to Kubernetes controllers:

You run gco inference deploy my-llm -i vllm/vllm-openai:v0.20.1 --gpu-count 1
The CLI writes the endpoint spec to a DynamoDB table (desired state)
An inference_monitor service running in each target region polls the table
The monitor creates Kubernetes Deployments, Services, and Ingress rules to match the desired state
If anything drifts (pod deleted, resource missing), the monitor self-heals by recreating it
Global Accelerator routes user requests to the nearest healthy region

gco inference deploy
        │
        ▼
  DynamoDB (desired state)
        │
        ▼ (each region polls)
  ┌──────────┐    ┌──────────┐
  │us-east-1 │    │eu-west-1 │
  │Deployment│    │Deployment│
  │Service   │    │Service   │
  │Ingress   │    │Ingress   │
  └────┬─────┘    └────┬─────┘
       └──────┬────────┘
              ▼
     Global Accelerator
              │
         End Users

All inference endpoints share the same ALB as the main GCO services — one ALB per region, cost-efficient, and already registered with Global Accelerator.

Key capabilities:

Deploy to one or all regions with a single command
Rolling updates, canary deployments (A/B testing), stop/start
Automatic model weight sync from S3 via init containers
Autoscaling via HPA (CPU/memory metrics)
Spot instance support for significant cost savings

Supported Frameworks

GCO works with any containerized inference server. These have example manifests in examples/:

Framework	Use Case	Example
vLLM	OpenAI-compatible LLM serving	`inference-vllm.yaml`
SGLang	High-throughput serving with RadixAttention	`inference-sglang.yaml`
TGI	HuggingFace optimized inference	`inference-tgi.yaml`
Triton	Multi-framework model serving	`inference-triton.yaml`
TorchServe	PyTorch native serving	`inference-torchserve.yaml`

See Inference Guide for the full deep dive including model weight management, canary deployments, and production EFA setup.

Storage Options

EFS (Elastic File System)

EFS is a shared file system accessible by all pods in a cluster. Use it for:

Job outputs that need to persist after pod termination
Sharing data between pods
Checkpoint storage for training jobs

volumes:
- name: shared-storage
  persistentVolumeClaim:
    claimName: gco-shared-storage

Characteristics:

Elastic (grows/shrinks automatically)
Lower throughput than FSx
Pay only for what you use
Good for general-purpose storage

FSx for Lustre

FSx for Lustre is a high-performance parallel file system. Use it for:

Large dataset training (high throughput needed)
Distributed training across multiple nodes
Workloads with heavy I/O requirements

volumes:
- name: fsx-storage
  persistentVolumeClaim:
    claimName: gco-fsx-storage

Characteristics:

Very high throughput (hundreds of GB/s possible)
Fixed capacity (must pre-provision)
Higher cost than EFS
Best for ML training workloads

When to use which:

Use Case	Recommended
Job logs and small outputs	EFS
Model checkpoints	EFS
Large dataset training	FSx for Lustre
Distributed training	FSx for Lustre
Cost-sensitive workloads	EFS

Security Model

GCO uses multiple security layers:

┌─────────────────────────────────────────────────────────┐
│ Layer 1: IAM Authentication                             │
│ - API Gateway validates AWS credentials (SigV4)         │
│ - Users need execute-api:Invoke permission              │
└─────────────────────────────────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────┐
│ Layer 2: Secret Header                                  │
│ - Lambda adds secret token from Secrets Manager         │
│ - Token rotates daily (zero-downtime)                   │
└─────────────────────────────────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────┐
│ Layer 3: Network Isolation                              │
│ - ALBs only accept Global Accelerator IPs               │
│ - EKS runs in private subnets                           │
└─────────────────────────────────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────┐
│ Layer 4: Kubernetes RBAC                                │
│ - Service accounts with least-privilege                 │
│ - Namespace isolation for user jobs                     │
└─────────────────────────────────────────────────────────┘

Key points:

All requests must be signed with AWS credentials
No anonymous access
Jobs run in isolated gco-jobs namespace
Platform services run in gco-system namespace

API Access Modes

GCO supports two API access modes:

Global API (Default)

The default mode routes all requests through a global API Gateway and Global Accelerator:

User → Global API Gateway → Global Accelerator → Regional ALB → EKS

Pros:

Single global endpoint
Automatic failover between regions
Edge caching via CloudFront

Cons:

Requires public ALB exposure
Traffic routes through Global Accelerator

Regional API (Private Access)

When public access is disabled, regional API Gateways provide direct access via VPC Lambdas:

User → Regional API Gateway → VPC Lambda → Internal ALB → EKS

Pros:

No public ALB exposure
Direct regional access
Maximum security posture

Cons:

Must specify target region
No automatic cross-region failover

Enable Regional APIs:

// cdk.json
{
  "api_gateway": {
    "regional_api_enabled": true
  }
}

Use Regional APIs:

# CLI flag
gco --regional-api jobs list --region us-east-1

# Or environment variable
export GCO_REGIONAL_API=true
gco jobs list --region us-east-1

How Components Work Together

Here's what happens when you submit a job:

1. You run: gco jobs submit my-job.yaml

2. CLI signs request with your AWS credentials (SigV4)
   └─► API Gateway validates your IAM permissions

3. Lambda proxy adds secret header
   └─► Ensures request came through authenticated path

4. Global Accelerator routes to nearest healthy region
   └─► Checks ALB health in each region

5. Regional ALB receives request
   └─► Validates it came from Global Accelerator

6. Manifest Processor pod processes the job
   └─► Validates YAML, applies to Kubernetes

7. Kubernetes scheduler sees pending pod
   └─► Finds appropriate nodepool

8. EKS Auto Mode provisions node (if needed)
   └─► Launches EC2 instance matching requirements

9. Pod runs on provisioned node
   └─► Your job executes

10. Job completes, outputs saved to EFS/FSx
    └─► Data persists after pod terminates

Common Workflows

Submit a Simple Job

# Check what regions are available
gco capacity status

# Submit to a specific region
gco jobs submit-sqs my-job.yaml --region us-east-1

# Or let GCO pick the best region
gco jobs submit-sqs my-job.yaml --auto-region

# Check job status
gco jobs list --all-regions

# Get logs
gco jobs logs my-job -n gco-jobs -r us-east-1

Run a GPU Training Job

# Check GPU capacity
gco capacity check --instance-type g5.xlarge --region us-east-1

# Submit GPU job
gco jobs submit-sqs examples/gpu-job.yaml --region us-east-1

# Monitor
gco jobs list -r us-east-1 -n gco-jobs

Save Job Outputs

# Submit job that writes to EFS
gco jobs submit-direct examples/efs-output-job.yaml -r us-east-1

# Wait for completion
gco jobs list -r us-east-1 -n gco-jobs

# Download outputs (works even after pod is deleted)
gco files download my-job-outputs ./local-dir -r us-east-1

Submit via Global Job Queue

Use the DynamoDB-backed queue when you need centralized tracking and status history:

# Submit to queue targeting a region
gco queue submit my-job.yaml --region us-east-1

# Track status
gco queue list --status running
gco queue get <job-id>

# View queue statistics
gco queue stats

Deploy an Inference Endpoint

GCO supports long-running inference endpoints across regions with automatic reconciliation:

# Deploy a vLLM endpoint
gco inference deploy my-llm -i vllm/vllm-openai:v0.20.1 --gpu-count 1

# Check status
gco inference status my-llm

# Send a prompt
gco inference invoke my-llm -p "Hello, world"

See Inference Guide for the full guide including model weight management and canary deployments.

Next Steps:

Quick Start Guide - Get running in under 60 minutes
Architecture Details - Deep dive into the system
CLI Reference - Complete command documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core Concepts

Table of Contents

What is GCO?

The Problem It Solves

Key Concepts

Multi-Region Architecture

EKS Auto Mode

Nodepools

Manifest Submission

Global Routing

Inference Serving

How Inference Works

Supported Frameworks

Storage Options

EFS (Elastic File System)

FSx for Lustre

Security Model

API Access Modes

Global API (Default)

Regional API (Private Access)

How Components Work Together

Common Workflows

Submit a Simple Job

Run a GPU Training Job

Save Job Outputs

Submit via Global Job Queue

Deploy an Inference Endpoint

FilesExpand file tree

CONCEPTS.md

Latest commit

History

CONCEPTS.md

File metadata and controls

Core Concepts

Table of Contents

What is GCO?

The Problem It Solves

Key Concepts

Multi-Region Architecture

EKS Auto Mode

Nodepools

Manifest Submission

Global Routing

Inference Serving

How Inference Works

Supported Frameworks

Storage Options

EFS (Elastic File System)

FSx for Lustre

Security Model

API Access Modes

Global API (Default)

Regional API (Private Access)

How Components Work Together

Common Workflows

Submit a Simple Job

Run a GPU Training Job

Save Job Outputs

Submit via Global Job Queue

Deploy an Inference Endpoint