Skip to content

Commit 5821e09

Browse files
committed
hosting: add Tier 3 (Kubernetes) — pod-per-session gateway, egress lockdown, kind quickstart
- gateway (FastAPI): auth → Redis session map → standby-pool claim → SSE relay - k8s.py lifted from create-claude-agent (thanks @joeshamon @benlehrburger) - egress-proxy + NetworkPolicy: agent pods reach api.anthropic.com only - kind-quickstart.sh: Calico CNI, API-key preflight, end-to-end tested on macOS - notebook: fill Tier-3 cell w/ kind quickstart pointer Tested: claim/replenish, idle reaper, SSE end-to-end, egress lockdown (ENETUNREACH).
1 parent 2775554 commit 5821e09

17 files changed

Lines changed: 1828 additions & 0 deletions

File tree

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# generated by generate-certs.sh — local-only, never commit
2+
certs/
Lines changed: 269 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
# Tier 3 — Kubernetes (pod-per-session)
2+
3+
> Part of the [Agent SDK hosting cookbook](../../07_Hosting_the_agent.ipynb).
4+
> If you haven't picked a hosting tier yet, start there — it covers when a
5+
> managed option is the better fit and when you actually need this.
6+
7+
Run the agent on a Kubernetes cluster where every session gets its own
8+
isolated pod, with network-level controls ensuring agent pods can only reach
9+
the Anthropic API.
10+
11+
```
12+
┌──────────────────────────────────────────────────┐
13+
│ Kubernetes │
14+
│ │
15+
curl / SDK ──────► Gateway (FastAPI) │
16+
│ ├─ creates/deletes agent pods via K8s API │
17+
│ ├─ routes /sessions/{id}/messages to right pod │
18+
│ └─ session → pod mapping stored in Redis │
19+
│ │
20+
┌──────┴──────┐ │
21+
│ │ │
22+
Agent Pod Agent Pod ──► Egress Proxy ──► api.anthropic.com │
23+
(session A) (session B) ▲ │
24+
│ │ │ │
25+
│ NetworkPolicy: pods can ONLY reach egress-proxy │
26+
│ │
27+
Redis (session → pod-IP mapping) │
28+
│ │
29+
└──────────────────────────────────────────────────┘
30+
```
31+
32+
The agent image is the **same one** Tier 1 builds from
33+
[`hosting/Dockerfile`](../Dockerfile). Same image, different machinery: instead
34+
of a single container or a Modal sandbox, the gateway gives each session its
35+
own pod and the cluster enforces what that pod can reach.
36+
37+
> **Before you self-host:** if you just want a hosted agent without running
38+
> infrastructure, use Anthropic's managed option — see the
39+
> [Hosting overview](../README.md). This guide is for teams that need the
40+
> agent on their own Kubernetes cluster (regulated environments, existing
41+
> platform, custom networking).
42+
43+
44+
## Why each piece exists
45+
46+
**Gateway** — Each user session gets its own agent pod. Something has to create
47+
those pods on demand, route traffic to the right one, and clean them up when
48+
sessions go idle. That's the gateway. It talks to the Kubernetes API to manage
49+
pod lifecycles and uses Redis to remember which session maps to which pod IP.
50+
51+
**Egress proxy + NetworkPolicy** — Agents run arbitrary code. This pair ensures
52+
agent pods can reach `api.anthropic.com` and *nothing else*. The NetworkPolicy
53+
blocks all outbound traffic except to the egress proxy (port 443) and DNS
54+
(port 53). The egress proxy terminates TLS from the agent, then re-encrypts the
55+
request to Anthropic's API. Any attempt to reach the internet, other services,
56+
or other namespaces is dropped at the network level.
57+
58+
**Redis** — The gateway needs to remember which pod is handling which session.
59+
When a request arrives, it looks up the session ID in Redis to find the pod IP
60+
and routes traffic there. Redis persists to disk so mappings survive gateway
61+
restarts.
62+
63+
**Standby pool** — Pods take 10–30 seconds to start (image pull + container
64+
boot). The gateway pre-warms a configurable number of standby pods so new
65+
sessions can claim one instantly instead of waiting. After a pod is claimed,
66+
the pool replenishes in the background.
67+
68+
## Prerequisites
69+
70+
| Tool | What it's for |
71+
|------|---------------|
72+
| [kind](https://kind.sigs.k8s.io/) | Local Kubernetes cluster in Docker |
73+
| [kubectl](https://kubernetes.io/docs/tasks/tools/) | Applying manifests, inspecting the cluster |
74+
| [docker](https://docs.docker.com/get-docker/) | Building container images |
75+
| `openssl` | Generating the egress proxy's TLS certificate |
76+
| `ANTHROPIC_API_KEY` | Set as env var |
77+
78+
## Quickstart (local, with kind)
79+
80+
```bash
81+
cd hosting/kubernetes
82+
export ANTHROPIC_API_KEY=sk-ant-...
83+
./kind-quickstart.sh
84+
```
85+
86+
This builds the three images, loads them into a local `kind` cluster, applies
87+
every manifest, and port-forwards the gateway to `localhost:8080`.
88+
89+
## Talk to it
90+
91+
Same path and shape as Tier 1/2 — only the base URL changes:
92+
93+
```bash
94+
curl -N -X POST http://localhost:8080/sessions/demo/messages \
95+
-H 'Content-Type: application/json' \
96+
-d '{"prompt": "What tools do you have?"}'
97+
```
98+
99+
The first request on a new `session_id` claims a standby pod (or spawns one if
100+
the pool is empty). Subsequent requests with the same `session_id` route to the
101+
same pod, so the agent sees a continuous conversation.
102+
103+
Watch the machinery work:
104+
105+
```bash
106+
kubectl -n claude-agent get pods -w
107+
# you'll see agent-standby-* pods appear, then one flip to active when you curl
108+
```
109+
110+
To end a session, go through the gateway so the Redis mapping is cleaned up:
111+
112+
```bash
113+
curl -X DELETE http://localhost:8080/sessions/demo
114+
```
115+
116+
(`kubectl delete pod` works too, but leaves a stale `session → pod-IP` entry
117+
in Redis until the next request on that session 502s.)
118+
119+
## Verify the egress lockdown
120+
121+
The agent runs code the model decides to run. The egress proxy + NetworkPolicy
122+
mean a prompt-injected agent still can't reach arbitrary hosts. Prove it:
123+
124+
> `kind-quickstart.sh` installs Calico because kind's default CNI (kindnet)
125+
> doesn't enforce NetworkPolicy. On GKE/EKS/AKS or any Calico/Cilium cluster,
126+
> enforcement is on by default and this section works unchanged.
127+
128+
```bash
129+
AGENT_POD=$(kubectl -n claude-agent get pods -l role=agent \
130+
-o jsonpath='{.items[0].metadata.name}')
131+
132+
# This should FAIL — Calico drops the route to anything except egress-proxy.
133+
# (The agent image is slim and has no curl, so we use Python's socket.)
134+
kubectl -n claude-agent exec "$AGENT_POD" -- python3 -c \
135+
"import socket; socket.setdefaulttimeout(5); socket.create_connection(('example.com',443)); print('REACHED — policy NOT enforcing')"
136+
```
137+
138+
Expected: `OSError: [Errno 101] Network is unreachable` (or a timeout) and a
139+
non-zero exit. The positive control — that the egress-proxy path *is* open —
140+
was already proven by the curl above returning model output.
141+
142+
## Standby pool
143+
144+
`STANDBY_POOL_SIZE` (in the `agent-config` ConfigMap) controls how many warm
145+
pods the gateway keeps ready. Check current state:
146+
147+
```bash
148+
curl http://localhost:8080/api/pool
149+
```
150+
151+
## Persistence
152+
153+
`server.py` persists transcripts (and its caller-ID → SDK-ID map) to
154+
`CLAUDE_CONFIG_DIR=/data`. In this tier that's the pod's ephemeral filesystem,
155+
so:
156+
157+
- **While the pod is alive** (within the idle-timeout window), follow-up
158+
messages resume the conversation exactly as in Tiers 1 and 2.
159+
- **After the pod is reaped**, `/data` is gone. The next message on that
160+
`session_id` gets a fresh pod with no history.
161+
162+
For a cookbook demo this is fine — sessions outlive the curl, not the cluster.
163+
For production you need durable storage that survives pod recycle. Two options:
164+
165+
1. **Mount a PersistentVolumeClaim** at `/data` instead of the pod's local
166+
disk, and have the gateway reattach the same PVC when a session returns.
167+
Works with `server.py` as-is, but couples each session to a volume in one
168+
zone.
169+
2. **Mirror `/data` to external storage** with the SDK's
170+
[`SessionStore`](https://code.claude.com/docs/en/agent-sdk/session-storage):
171+
the local-disk write still happens first; the store is a mirror, and
172+
`mirror_error` is non-fatal. This is the approach the notebook's
173+
*Making it production-ready* section describes — it needs a small hook in
174+
`server.py` that the cookbook hasn't grown yet.
175+
176+
## Deploying to your own cluster
177+
178+
`kind` proves the topology; the manifests are cloud-agnostic. To run on EKS,
179+
AKS, GKE, OpenShift, or bare metal, swap the image registry and the front door:
180+
181+
```bash
182+
REG=your.registry.example.com/claude-agent # ECR, ACR, GHCR, Artifact Registry, ...
183+
184+
# 1. Build and push the three images
185+
docker build -t $REG/agent:latest -f ../Dockerfile ..
186+
docker build -t $REG/gateway:latest ./gateway
187+
docker build -t $REG/egress-proxy:latest ./egress-proxy
188+
docker push $REG/agent:latest $REG/gateway:latest $REG/egress-proxy:latest
189+
190+
# 2. TLS certs for the egress proxy
191+
./generate-certs.sh
192+
193+
# 3. Namespace + secrets + config
194+
kubectl apply -f manifests/namespace.yaml
195+
kubectl -n claude-agent create secret generic anthropic-api-key \
196+
--from-literal=ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY"
197+
kubectl -n claude-agent create secret generic egress-proxy-tls \
198+
--from-file=ca.crt=certs/ca.crt \
199+
--from-file=proxy.crt=certs/proxy.crt \
200+
--from-file=proxy.key=certs/proxy.key
201+
kubectl -n claude-agent create configmap agent-config \
202+
--from-literal=AGENT_IMAGE=$REG/agent:latest \
203+
--from-literal=STANDBY_POOL_SIZE=2
204+
205+
# 4. Apply manifests with your registry substituted
206+
for f in manifests/*.yaml; do
207+
sed "s|REGISTRY_URL|$REG|g" "$f" | kubectl apply -f -
208+
done
209+
```
210+
211+
Then expose the `gateway` Service through whatever your cluster uses for
212+
ingress — a cloud LoadBalancer, an Ingress controller, or a service mesh
213+
gateway. Three things vary by environment:
214+
215+
- **Registry auth** — your nodes need pull credentials for `$REG`
216+
(`imagePullSecrets`, IRSA/Workload Identity, or a public registry).
217+
- **NetworkPolicy enforcement** — the egress lockdown only works if your CNI
218+
enforces `NetworkPolicy` (Cilium, Calico, GKE Dataplane V2, EKS with the
219+
VPC CNI policy add-on). On a CNI that ignores it, agent pods can reach the
220+
internet.
221+
- **TLS + auth in front of the gateway**`GATEWAY_AUTH_TOKEN` is a
222+
placeholder. Put your IdP / API gateway in front before exposing this
223+
publicly.
224+
225+
## What this doesn't give you
226+
227+
- Real authentication or multi-tenancy (the `authenticate()` stub returns one
228+
hard-coded tenant)
229+
- Durable session storage (see [Persistence](#persistence))
230+
- Gateway autoscaling or multi-region routing
231+
- Observability beyond what
232+
[`OTEL_EXPORTER_OTLP_ENDPOINT`](../README.md#observability) gives you for free
233+
234+
## Teardown
235+
236+
```bash
237+
./teardown.sh # kind delete cluster + remove certs/
238+
```
239+
240+
## Layout
241+
242+
```
243+
kubernetes/
244+
├── README.md
245+
├── kind-quickstart.sh # local end-to-end on kind
246+
├── teardown.sh
247+
├── generate-certs.sh # self-signed CA + proxy cert for egress-proxy
248+
├── gateway/
249+
│ ├── main.py # FastAPI: route + reap
250+
│ ├── k8s.py # pod lifecycle + standby pool
251+
│ ├── proxy.py # SSE relay
252+
│ ├── requirements.txt
253+
│ └── Dockerfile
254+
├── egress-proxy/
255+
│ ├── nginx.conf
256+
│ └── Dockerfile
257+
└── manifests/
258+
├── namespace.yaml
259+
├── redis.yaml
260+
├── egress-proxy.yaml
261+
├── gateway.yaml # SA + RBAC + Deployment + Service
262+
└── network-policy.yaml
263+
```
264+
265+
---
266+
267+
The pod lifecycle management (`k8s.py`), egress proxy, and network policy are
268+
adapted from Anthropic's internal `create-claude-agent` harness by Joe Shamon
269+
and Ben Lehrburger.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Egress proxy: TLS-terminating nginx that ONLY forwards to api.anthropic.com.
2+
# Plain http{} reverse-proxy — no stream module needed.
3+
FROM nginx:alpine
4+
5+
COPY nginx.conf /etc/nginx/nginx.conf
6+
7+
EXPOSE 443 80
8+
9+
CMD ["nginx", "-g", "daemon off;"]
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# =============================================================================
2+
# Egress Proxy — nginx configuration
3+
# =============================================================================
4+
#
5+
# WHY THIS EXISTS:
6+
# Agent pods can execute arbitrary code (that's the whole point of a coding
7+
# agent). Without network controls, a compromised or misbehaving agent could
8+
# reach any host on the internet — exfiltrating data, attacking internal
9+
# services, or abusing external APIs.
10+
#
11+
# This proxy, combined with a Kubernetes NetworkPolicy, ensures that agent
12+
# pods can ONLY reach api.anthropic.com and nothing else:
13+
#
14+
# 1. The K8s NetworkPolicy blocks all egress from agent pods except to
15+
# this proxy (and DNS).
16+
# 2. This proxy only forwards requests to api.anthropic.com.
17+
#
18+
# Together, they form a strict allowlist for agent network access.
19+
#
20+
# HOW TLS WORKS HERE:
21+
# The proxy terminates TLS using a self-signed certificate (generated by
22+
# generate-certs.sh). Agent pods trust this certificate via the
23+
# NODE_EXTRA_CA_CERTS environment variable, which points to the CA cert
24+
# that signed the proxy's certificate. The proxy then makes its own TLS
25+
# connection to the real api.anthropic.com upstream.
26+
#
27+
# Traffic flow:
28+
# Agent pod --[TLS with self-signed cert]--> egress-proxy --[TLS]--> api.anthropic.com
29+
#
30+
# EXTENDING TO MORE ENDPOINTS:
31+
# This demo only proxies the Claude API via ANTHROPIC_BASE_URL. For full
32+
# telemetry and error reporting (statsig.anthropic.com, sentry.io, claude.ai),
33+
# you would use HTTPS_PROXY instead. See:
34+
# https://code.claude.com/docs/en/corporate-proxy
35+
# =============================================================================
36+
37+
user nginx;
38+
worker_processes auto;
39+
error_log /var/log/nginx/error.log info;
40+
pid /run/nginx.pid;
41+
42+
events {
43+
worker_connections 1024;
44+
}
45+
46+
http {
47+
access_log /var/log/nginx/access.log;
48+
49+
# DNS resolver (IPv4 only to avoid Docker network issues)
50+
resolver 8.8.8.8 ipv6=off valid=30s;
51+
52+
# Anthropic API upstream
53+
upstream anthropic_api {
54+
server api.anthropic.com:443;
55+
keepalive 32;
56+
}
57+
58+
# HTTPS server - terminates TLS from agent, proxies to Anthropic API
59+
server {
60+
listen 443 ssl;
61+
server_name egress-proxy;
62+
63+
# Our self-signed certificate (signed by demo CA)
64+
ssl_certificate /etc/nginx/certs/proxy.crt;
65+
ssl_certificate_key /etc/nginx/certs/proxy.key;
66+
ssl_protocols TLSv1.2 TLSv1.3;
67+
ssl_ciphers HIGH:!aNULL:!MD5;
68+
69+
# Proxy all requests to Anthropic API
70+
location / {
71+
proxy_pass https://anthropic_api;
72+
proxy_ssl_server_name on;
73+
proxy_ssl_name api.anthropic.com;
74+
75+
# Set correct Host header for Cloudflare
76+
proxy_set_header Host api.anthropic.com;
77+
proxy_set_header X-Real-IP $remote_addr;
78+
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
79+
proxy_set_header X-Forwarded-Proto $scheme;
80+
81+
# Timeouts for long API calls
82+
proxy_connect_timeout 60s;
83+
proxy_send_timeout 300s;
84+
proxy_read_timeout 300s;
85+
86+
# For streaming responses
87+
proxy_buffering off;
88+
proxy_http_version 1.1;
89+
proxy_set_header Connection "";
90+
}
91+
}
92+
93+
# HTTP health check
94+
server {
95+
listen 80;
96+
97+
location /health {
98+
return 200 'OK';
99+
add_header Content-Type text/plain;
100+
}
101+
102+
location / {
103+
return 403 'Use HTTPS';
104+
}
105+
}
106+
}

0 commit comments

Comments
 (0)