Deploy and manage multi-region GPU inference endpoints with GCO (Global Capacity Orchestrator on AWS).
- Overview
- Architecture
- Model Weight Management
- Deploying Inference Endpoints
- Supported Frameworks
- Managing Endpoints
- Invoking Endpoints
- Multi-Region Deployment
- Monitoring Endpoint Status
- Example Workflows
GCO's inference serving extends the platform beyond batch GPU jobs to support long-running inference endpoints. You define an endpoint once, and GCO deploys it across your target regions with automatic reconciliation, model weight syncing, and Global Accelerator routing.
Key capabilities:
- Deploy inference endpoints to one or more regions with a single command
- Automatic model weight sync from S3 to each region via init containers
- DynamoDB-backed desired state with continuous reconciliation
- Rolling updates, scaling, stop/start without losing configuration
- Global Accelerator routing to the nearest healthy region
- Support for vLLM, TGI, Triton, TorchServe, and SGLang out of the box
Inference serving uses a reconciliation pattern similar to Kubernetes controllers:
User → gco inference deploy
│
▼
DynamoDB (desired state)
│
▼ (each region's inference_monitor polls)
┌────────────────┐ ┌────────────────┐
│ us-east-1 │ │ eu-west-1 │
│ ┌──────────┐ │ │ ┌──────────┐ │
│ │ init: │ │ │ │ init: │ │
│ │ S3 sync │ │ │ │ S3 sync │ │
│ └────┬─────┘ │ │ └────┬─────┘ │
│ ┌────▼─────┐ │ │ ┌────▼─────┐ │
│ │ Inference│ │ │ │ Inference│ │
│ │ (GPU) │ │ │ │ (GPU) │ │
│ └────┬─────┘ │ │ └────┬─────┘ │
│ ┌────▼─────┐ │ │ ┌────▼─────┐ │
│ │ Service │ │ │ │ Service │ │
│ └────┬─────┘ │ │ └────┬─────┘ │
│ ┌────▼─────┐ │ │ ┌────▼─────┐ │
│ │ Ingress │ │ │ │ Ingress │ │
│ │ (ALB) │ │ │ │ (ALB) │ │
│ └────┬─────┘ │ │ └────┬─────┘ │
└───────┼────────┘ └───────┼────────┘
│ │
└──────┬──────────────┘
▼
Global Accelerator
(anycast IPs, health routing)
│
▼
End Users
(nearest healthy region)
gco inference deploywrites the endpoint spec to a DynamoDB table (gco-inference-endpoints)- The
inference_monitorservice running in each target region polls the table every 15 seconds - For each endpoint targeting its region, the monitor reconciles desired state with actual K8s resources:
- Creates/updates Deployments, Services, and Ingress rules
- Recreates any missing resources (self-healing)
- Purges fully-deleted endpoints from DynamoDB automatically
All inference endpoints share the same ALB as the main GCO services via EKS Auto Mode's IngressClassParams with group.name: gco. This means:
- One ALB per region (cost-efficient, registered with Global Accelerator)
- Inference requests route through the same ALB as job management APIs
- URL rewrite transforms strip the
/inference/{name}prefix before forwarding to the pod - The inference_monitor creates an ExternalName proxy Service in
gco-systemso the Ingress can reference the inference pod Service across namespaces
The health monitor periodically verifies the ALB hostname stored in SSM matches the actual ALB from the Kubernetes Ingress status. If the ALB changes (e.g., due to cluster recreation or IngressClassParams updates), SSM is updated automatically so the cross-region aggregator and API Gateway proxy continue routing correctly.
- If
model_sourceis set, adds an init container that syncs model weights from S3 to local EFS - Reports per-region status (replicas ready, errors) back to DynamoDB
- State transitions (
deploying→running→stopped→deleted) are driven by the CLI and reconciled by the monitor
Inference workloads use a Karpenter NodePool with WhenEmpty consolidation policy. Unlike batch job NodePools that aggressively consolidate underutilized nodes, inference nodes are only removed when completely empty. This prevents disruption to long-running serving pods.
GCO provides a central S3 bucket (KMS-encrypted) for storing model weights. Models uploaded here are automatically available to inference endpoints across all regions.
# Upload a directory of model files
gco models upload ./my-model-weights/ --name llama3-8b
# Upload a single file
gco models upload ./weights.safetensors --name my-modelgco models listOutput:
Models (2 found)
----------------------------------------------------------------------
NAME FILES SIZE (GB) S3 URI
----------------------------------------------------------------------
llama3-8b 12 14.96 s3://gco-models-xxx/models/llama3-8b
my-model 1 0.50 s3://gco-models-xxx/models/my-model
# Get the S3 URI for use with --model-source
gco models uri llama3-8b
# Output: s3://gco-models-xxx/models/llama3-8bgco models delete llama3-8b -yWhen you deploy an endpoint with --model-source, the inference_monitor adds an init container to the Deployment that:
- Runs before the inference container starts
- Uses
aws s3 syncto download model weights from S3 to a shared EFS volume - Mounts the EFS volume at the model path inside the inference container
This happens automatically in every target region, so model weights are always local to the cluster.
# Deploy vLLM serving a model (downloads from HuggingFace at startup)
gco inference deploy my-llm \
-i vllm/vllm-openai:v0.20.1 \
--gpu-count 1 \
-e MODEL=meta-llama/Llama-3.1-8B-Instruct# Upload weights first
gco models upload ./llama3-weights/ --name llama3-8b
# Deploy with model sync from S3
gco inference deploy my-llm \
-i vllm/vllm-openai:v0.20.1 \
--gpu-count 1 \
--model-source $(gco models uri llama3-8b) \
-e MODEL=/models/my-llmgco inference deploy ENDPOINT_NAME \
--image IMAGE # Container image (required)
--region REGION # Target region(s), repeatable (default: all)
--replicas N # Replicas per region (default: 1)
--gpu-count N # GPUs per replica (default: 1)
--gpu-type TYPE # GPU instance type hint (e.g. g5.xlarge)
--port PORT # Container port (default: 8000)
--model-path PATH # EFS path for model weights
--model-source S3_URI # S3 URI for auto-sync via init container
--health-path PATH # Health check endpoint (default: /health)
--env KEY=VALUE # Environment variable, repeatable
--namespace NS # K8s namespace (default: gco-inference)
--label KEY=VALUE # Label, repeatableFor each target region, the inference_monitor creates:
- Deployment — Runs the inference container with GPU resource requests, optional init container for S3 model sync
- Service — ClusterIP service exposing the container port
- Ingress rule — ALB path at
/inference/<endpoint-name>for external access via Global Accelerator
GCO works with any containerized inference server. These frameworks have example manifests in examples/:
| Framework | Image Example | Default Port | Health Path | Use Case |
|---|---|---|---|---|
| vLLM | vllm/vllm-openai:v0.20.1 |
8000 | /health |
OpenAI-compatible LLM serving |
| TGI | ghcr.io/huggingface/text-generation-inference:3.3.7 |
8080 | /health |
HuggingFace model serving |
| Triton | nvcr.io/nvidia/tritonserver:24.01-py3 |
8000 | /v2/health/ready |
Multi-framework model serving |
| TorchServe | pytorch/torchserve:latest-gpu |
8080 | /ping |
PyTorch model serving |
| SGLang | lmsysorg/sglang:v0.5.10 |
30000 | /health |
High-throughput LLM serving with RadixAttention |
gco inference deploy vllm-llama3 \
-i vllm/vllm-openai:v0.20.1 \
--gpu-count 1 \
-e MODEL=meta-llama/Llama-3.1-8B-Instruct \
-e MAX_MODEL_LEN=4096gco inference deploy tgi-mistral \
-i ghcr.io/huggingface/text-generation-inference:3.3.7 \
--port 8080 \
--health-path /health \
--gpu-count 1 \
-e MODEL_ID=mistralai/Mistral-7B-Instruct-v0.2gco inference deploy triton-models \
-i nvcr.io/nvidia/tritonserver:24.01-py3 \
--port 8000 \
--health-path /v2/health/ready \
--gpu-count 1 \
--model-source s3://your-bucket/models/triton-repogco inference deploy torchserve-resnet \
-i pytorch/torchserve:latest-gpu \
--port 8080 \
--health-path /ping \
--gpu-count 1 \
--model-source s3://your-bucket/models/torchserve-mar# List all endpoints
gco inference list
# Filter by state
gco inference list --state running
# Filter by region
gco inference list -r us-east-1# Scale to 4 replicas (applied across all target regions)
gco inference scale my-llm --replicas 4Inference endpoints support Horizontal Pod Autoscaler (HPA) for automatic scaling based on resource utilization. When autoscaling is enabled, the inference_monitor creates a Kubernetes HPA alongside the Deployment.
# Deploy with autoscaling enabled
gco inference deploy my-llm \
-i vllm/vllm-openai:v0.20.1 \
--replicas 2 --gpu-count 1 \
--min-replicas 1 --max-replicas 8 \
--autoscale-metric cpu:70 --autoscale-metric memory:80Supported metrics:
| Metric | Description | Example |
|---|---|---|
cpu |
CPU utilization percentage | cpu:70 (scale at 70% CPU) |
memory |
Memory utilization percentage | memory:80 (scale at 80% memory) |
The --autoscale-metric flag is repeatable — you can combine multiple metrics. The format is type:target where target is the utilization percentage threshold. If no target is specified, it defaults to 70%.
The HPA respects --min-replicas (default: 1) and --max-replicas (default: 10) bounds. The --replicas flag sets the initial replica count before the HPA takes over.
# Triggers a rolling update in all target regions
gco inference update-image my-llm -i vllm/vllm-openai:v0.20.1# Stop (scales to zero, keeps configuration)
gco inference stop my-llm -y
# Start (restores previous replica count)
gco inference start my-llm# Mark for deletion — inference_monitor cleans up K8s resources in each region
gco inference delete my-llm -yCanary deployments let you test a new model version with a percentage of traffic before fully rolling it out. The primary deployment continues serving most traffic while the canary receives a configurable slice.
# Start a canary: 10% traffic to v0.9.0, 90% stays on current primary
gco inference canary my-llm -i vllm/vllm-openai:v0.20.1 --weight 10
# Increase canary traffic to 25%
gco inference canary my-llm -i vllm/vllm-openai:v0.20.1 --weight 25
# Happy with the canary? Promote it to primary (100% traffic)
gco inference promote my-llm -y
# Something wrong? Roll back (removes canary, 100% to primary)
gco inference rollback my-llm -yHow it works:
canarystores the canary config (image, weight, replicas) in the endpoint spec in DynamoDB- The inference_monitor creates a second deployment (
{name}-canary) and service in each target region - The ingress is updated with ALB weighted routing annotations to split traffic
promoteswaps the primary image to the canary image and removes the canaryrollbackremoves the canary deployment and restores 100% traffic to the primary
Use spot instances to reduce inference serving costs. Spot GPU instances can be significantly cheaper than on-demand but can be interrupted with 2 minutes notice.
# Deploy on spot instances
gco inference deploy my-llm -i vllm/vllm-openai:v0.20.1 --gpu-count 1 --capacity-type spot
# Deploy on on-demand (default, guaranteed availability)
gco inference deploy my-llm -i vllm/vllm-openai:v0.20.1 --gpu-count 1 --capacity-type on-demandWhen --capacity-type spot is set, the inference_monitor adds a karpenter.sh/capacity-type: spot node selector to the deployment. Karpenter then provisions spot GPU instances for those pods.
When to use spot for inference:
- Development and testing environments
- Non-critical inference endpoints with multiple replicas (if one gets interrupted, others continue serving)
- Cost-sensitive workloads where occasional brief interruptions are acceptable
When to use on-demand (default):
- Production inference endpoints requiring high availability
- Single-replica deployments where interruption means downtime
Use --accelerator neuron to deploy inference on AWS Trainium or Inferentia instances instead of NVIDIA GPUs. This uses aws.amazon.com/neuron resources and schedules on the Neuron nodepool.
# Deploy on Trainium or Inferentia
gco inference deploy my-model \
-i public.ecr.aws/neuron/your-neuron-image:latest \
--gpu-count 1 --accelerator neuronTo target a specific instance family (e.g., Inferentia only), add --node-selector:
gco inference deploy my-model \
-i public.ecr.aws/neuron/your-neuron-image:latest \
--gpu-count 1 --accelerator neuron \
--node-selector eks.amazonaws.com/instance-family=inf2Container images must include the Neuron runtime. Use images from public.ecr.aws/neuron/ or build your own with the Neuron SDK.
Once an endpoint is running, you can send prompts and chat conversations directly from the CLI or via MCP tools.
# Simple prompt (auto-detects framework from container image)
gco inference invoke my-llm -p "What is GPU orchestration?"
# With max tokens
gco inference invoke my-llm -p "Explain Kubernetes" --max-tokens 200
# Explicit API path
gco inference invoke my-llm -p "Hello" --path /v1/completions
# Raw JSON body for full control
gco inference invoke my-llm -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "Hello", "max_tokens": 50}'The CLI auto-detects the serving framework from the container image and builds the appropriate request body:
- vLLM →
/v1/completions(OpenAI-compatible) - TGI →
/generate(HuggingFace format) - Triton →
/v2/models(Triton HTTP API)
For multi-turn conversations with chat models, use the /v1/chat/completions path:
# Chat-style request via raw JSON
gco inference invoke my-llm \
-d '{"messages": [{"role": "user", "content": "What is Kubernetes?"}], "max_tokens": 256}' \
--path /v1/chat/completionsThe MCP server exposes a dedicated chat_inference tool that accepts a messages array directly, making it easy for AI agents to have multi-turn conversations with your endpoints.
Verify an endpoint is ready before sending requests:
# Check health (routes via Global Accelerator to nearest region)
gco inference health my-llm
# Check a specific region
gco inference health my-llm -r us-east-1Returns HTTP status and round-trip latency in milliseconds.
Query which models are loaded on an endpoint (OpenAI-compatible servers):
gco inference models my-llmReturns the /v1/models response including model IDs, context lengths, and metadata.
The MCP server exposes four inference interaction tools so AI agents can use your endpoints programmatically:
| Tool | Description |
|---|---|
invoke_inference |
Single-turn text completion with auto framework detection |
chat_inference |
Multi-turn chat with OpenAI-compatible messages format |
inference_health |
Health check with latency reporting |
list_endpoint_models |
Discover loaded models via /v1/models |
Both invoke_inference and chat_inference support a stream parameter. When enabled, the request is sent with streaming mode, which reduces time-to-first-token for long generations.
Each regional stack can include a Valkey Serverless cache for microsecond-latency key-value storage. Common inference use cases:
- Prompt caching (avoid re-computing identical prompts)
- Session state for multi-turn conversations
- Feature stores for real-time model inputs
- Rate limiting and request deduplication
Enable in cdk.json:
"valkey": {
"enabled": true,
"max_data_storage_gb": 5,
"max_ecpu_per_second": 5000
}The endpoint is discoverable via SSM parameter /{project}/valkey-endpoint-{region}. See Customization Guide for full configuration options and examples/valkey-cache-job.yaml for a working example.
GCO's inference endpoints pair well with Retrieval-Augmented Generation (RAG) workflows. Here's how the components fit together:
Use the Valkey cache to avoid redundant inference calls for semantically similar prompts.
import hashlib
import json
import boto3
import valkey
# Connect to Valkey and Bedrock
cache = valkey.Valkey(host="VALKEY_ENDPOINT", port=6379, ssl=True)
bedrock = boto3.client("bedrock-runtime")
def get_embedding(text):
"""Get embedding from Amazon Bedrock."""
response = bedrock.invoke_model(
modelId="amazon.titan-embed-text-v2:0",
body=json.dumps({"inputText": text}),
)
return json.loads(response["body"].read())["embedding"]
def cached_inference(prompt, inference_fn):
"""Check cache before calling inference."""
cache_key = f"prompt:{hashlib.sha256(prompt.encode()).hexdigest()}"
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
result = inference_fn(prompt)
cache.setex(cache_key, 3600, json.dumps(result)) # cache 1 hour
return resultFor the retrieval component of RAG, GCO doesn't include a built-in vector database — this is intentional to avoid being opinionated about a rapidly evolving space. Recommended options:
| Option | Best For | Managed |
|---|---|---|
| Amazon OpenSearch Serverless | Production RAG with full-text + vector search | Yes |
| Amazon Bedrock Knowledge Bases | Fully managed RAG with zero infrastructure | Yes |
| pgvector on Amazon RDS | Teams already using PostgreSQL | Yes |
| ElastiCache Valkey 8.2 (node-based) | Microsecond-latency vector search at scale | Yes |
| ChromaDB / Qdrant on EKS | Self-hosted, full control | No |
A typical RAG flow with GCO:
User query
→ Valkey cache check (semantic cache hit?)
→ If miss: embed query (Bedrock Titan)
→ Vector search (OpenSearch / Bedrock KB / pgvector)
→ Augment prompt with retrieved context
→ Inference endpoint (gco inference invoke)
→ Cache result in Valkey
→ Return to user
See examples/valkey-cache-job.yaml for a working Valkey caching example.
By default, gco inference deploy targets all deployed regions. This is the recommended approach because Global Accelerator routes users to the nearest healthy region — if an endpoint only exists in some regions, users routed to a region without it will get a 404.
# Deploy to all regions (recommended — ensures consistent global routing)
gco inference deploy my-llm \
-i vllm/vllm-openai:v0.20.1
# Deploy to specific regions (use with caution — see note below)
gco inference deploy my-llm \
-i vllm/vllm-openai:v0.20.1 \
-r us-east-1 -r eu-west-1Routing caveat: If you deploy to a subset of regions, Global Accelerator may route users to a region where the endpoint doesn't exist. The CLI warns you about this. For production inference, deploy to all regions or ensure your users only connect from regions where the endpoint is available.
Once deployed, inference endpoints are accessible through Global Accelerator at:
https://<GA_ENDPOINT>/inference/<endpoint-name>/
Global Accelerator automatically routes requests to the nearest healthy region. If a region becomes unhealthy, traffic fails over to the next closest region.
Each region independently reconciles and reports its status:
gco inference status my-llm Endpoint: my-llm
------------------------------------------------------------
State: running
Image: vllm/vllm-openai:v0.20.1
Replicas: 2
GPUs: 1
Port: 8000
Path: /inference/my-llm
Namespace: gco-inference
Created: 2025-01-15T10:30:00+00:00
Region Status:
REGION STATE READY DESIRED LAST SYNC
-----------------------------------------------------------------
us-east-1 running 2 2 2025-01-15T10:35:00
eu-west-1 running 2 2 2025-01-15T10:35:12
# Detailed status with per-region breakdown
gco inference status my-llm
# Quick list of all endpoints
gco inference list# Check pods
kubectl get pods -n gco-inference --context arn:aws:eks:us-east-1:ACCOUNT:cluster/gco-us-east-1
# Check deployment rollout
kubectl rollout status deployment/my-llm -n gco-inference
# View logs
kubectl logs -n gco-inference deployment/my-llm| State | Description |
|---|---|
deploying |
Endpoint registered, waiting for inference_monitor to create resources |
running |
All target regions have healthy replicas |
stopped |
Scaled to zero, configuration preserved |
deleted |
Marked for deletion, inference_monitor cleaning up resources |
# 1. Check GPU capacity
gco capacity check -i g5.xlarge -r us-east-1
# 2. Upload model weights (optional — vLLM can download from HuggingFace)
gco models upload ./llama3-weights/ --name llama3-8b
# 3. Deploy the endpoint
gco inference deploy vllm-llama3 \
-i vllm/vllm-openai:v0.20.1 \
--gpu-count 1 \
--model-source $(gco models uri llama3-8b) \
-e MODEL=/models/vllm-llama3 \
-r us-east-1
# 4. Monitor deployment
gco inference status vllm-llama3
# 5. Test the endpoint
curl https://GA_ENDPOINT/inference/vllm-llama3/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "/models/vllm-llama3", "prompt": "Hello", "max_tokens": 50}'
# 6. Scale up for production
gco inference scale vllm-llama3 --replicas 3
# 7. Update to a new version
gco inference update-image vllm-llama3 -i vllm/vllm-openai:v0.20.1
# 8. Clean up
gco inference delete vllm-llama3 -yFor development or quick testing, you can apply example manifests directly:
# Apply a vLLM example manifest directly
gco jobs submit-direct examples/inference-vllm.yaml -r us-east-1
# Other available examples:
# examples/inference-tgi.yaml
# examples/inference-triton.yaml
# examples/inference-torchserve.yaml
# examples/inference-sglang.yaml
# examples/model-download-job.yamlNote: Direct manifest submission creates resources in a single region only. For multi-region production deployments, use gco inference deploy.
Related documentation:
- CLI Reference — Full command reference for
inferenceandmodelscommands - Architecture Details — Infrastructure deep dive
- Quick Start Guide — Get running in under 60 minutes