Skip to content

Commit afc8d84

Browse files
author
Shaw
committed
Merge branch 'pr-7890' into develop
2 parents a90806b + 25e8d64 commit afc8d84

18 files changed

Lines changed: 639 additions & 5 deletions

File tree

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# Hetzner Control Plane vs Data Plane
2+
3+
eliza Cloud runs on a two-tier Hetzner Cloud setup. This doc nails down the
4+
split so we stop treating manually-created VMs as "infrastructure-by-prayer".
5+
6+
## Layers
7+
8+
```
9+
┌──────────────────────────────────────────────────────────────┐
10+
│ Tier 1 — Control plane (static, 1-2 VMs, Terraform) │
11+
│ │
12+
│ eliza-1 (Hetzner cpx21, fsn1) │
13+
│ ├── eliza-provisioning-worker (systemd, queue consumer)│
14+
│ ├── eliza-agent-router (systemd, HTTP routing) │
15+
│ ├── headscale (VPN mesh) │
16+
│ ├── cloudflared tunnel (public ingress) │
17+
│ ├── nginx (reverse proxy) │
18+
│ └── (optional: grafana/prometheus) │
19+
│ │
20+
│ Lifecycle: long-lived. Replaced on demand, not autoscaled.│
21+
│ Cost: ~€5/mo per VM (cpx21). │
22+
└──────────────────────────────────────────────────────────────┘
23+
│ enqueue / SSH
24+
25+
┌──────────────────────────────────────────────────────────────┐
26+
│ Tier 2 — Data plane (elastic, N cores, runtime autoscale) │
27+
│ │
28+
│ eliza-core-<hex> (Hetzner cpx32, fsn1) │
29+
│ ├── Docker daemon │
30+
│ └── eliza-sandbox containers × N │
31+
│ │
32+
│ Lifecycle: created/drained by node-autoscaler.ts based on │
33+
│ real demand. Server limit: ~25 (Hetzner default). │
34+
│ Cost: elastic (~€11/mo per running cpx32). │
35+
└──────────────────────────────────────────────────────────────┘
36+
```
37+
38+
## Why two tiers
39+
40+
| Concern | Control plane | Data plane |
41+
|----------------------|------------------------|---------------------------|
42+
| **Provisioning** | Terraform (one-shot) | Runtime API (node-autoscaler.ts) |
43+
| **Lifecycle** | Persistent | Ephemeral |
44+
| **State** | Has local state (headscale DB, cloudflared creds) | Stateless |
45+
| **Failure mode** | Page someone | Replace automatically |
46+
| **Cost predictability** | Fixed monthly | Elastic |
47+
| **What lives here** | Orchestrator, routing, monitoring | Just Docker + agents |
48+
49+
The split prevents the "control plane melts with the data plane during a
50+
traffic spike" failure mode. Pulling sandboxes off the data plane is the
51+
autoscaler's job; the orchestrator that issues drain commands must stay up
52+
to coordinate it.
53+
54+
## Code ↔ infrastructure mapping
55+
56+
| Component | Code | Infra |
57+
|---|---|---|
58+
| Control plane VM | [`packages/scripts/cloud/admin/daemons/provisioning-worker.ts`](../../../../scripts/cloud/admin/daemons/provisioning-worker.ts) | [Terraform: `control-plane/`](./control-plane/) |
59+
| Agent router | [`packages/scripts/cloud/admin/daemons/agent-router.ts`](../../../../scripts/cloud/admin/daemons/agent-router.ts) | systemd unit on control-plane VM |
60+
| Data plane autoscaler | [`packages/cloud-shared/src/lib/services/containers/node-autoscaler.ts`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | Hetzner Cloud API at runtime |
61+
| Sandbox provisioning | [`packages/cloud-shared/src/lib/services/docker-sandbox-provider.ts`](../../../../cloud-shared/src/lib/services/docker-sandbox-provider.ts) | SSH from control plane to data plane |
62+
63+
## Naming convention
64+
65+
| Layer | Prefix | Example | Where it's set |
66+
|---|---|---|---|
67+
| Control plane VM | `eliza-<n>` | `eliza-1` | Terraform `hcloud_server.control_plane` |
68+
| Data plane node (NEW) | `eliza-core-<hex>` | `eliza-core-38ea87b1` | [`generateNodeId()`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) |
69+
| Data plane node (LEGACY) | `milady-core-<n>` | `milady-core-1` | DEPRECATED — see [Legacy migration](#legacy-milady-core-migration) |
70+
71+
## Legacy `milady-core-*` migration
72+
73+
Pre-2026-05 the data plane was 6 manually-created `milady-core-*` VMs (Hetzner
74+
cpx32 in fsn1) inserted by-hand into `docker_nodes` with `capacity = 100`.
75+
By 2026-05-22:
76+
77+
- All 6 cores were `status: offline` (SSH health-check failing for weeks)
78+
- Several user sandboxes still ran on the underlying Docker daemons
79+
- The cloud autoscaler couldn't account for them
80+
81+
Migration 0132 (`0132_legacy_milady_cores_disable.sql`) flips them to
82+
`enabled = false` + fixes `capacity = 8`. This:
83+
84+
1. Removes them from autoscaler capacity decisions
85+
2. Stops the health-check noise
86+
3. Lets the autoscaler spin up replacement `eliza-core-<hex>` nodes on demand
87+
88+
Existing sandboxes keep running until next restart. On user-triggered
89+
restart / recreate, the daemon provisions them on a fresh autoscaled core.
90+
91+
Once `SELECT SUM(allocated_count) FROM docker_nodes WHERE node_id LIKE 'milady-core-%'`
92+
is `0`, ops can:
93+
94+
```bash
95+
# 1. Delete the Hetzner Cloud servers (one-time, via Hetzner console or):
96+
hcloud server delete milady-core-1
97+
hcloud server delete milady-core-2
98+
# ... etc.
99+
100+
# 2. Drop the DB rows:
101+
DELETE FROM docker_nodes WHERE node_id LIKE 'milady-core-%';
102+
```
103+
104+
## Followups (not in this initial PR)
105+
106+
- [ ] Terraform module for headscale state (preauth keys, ACLs)
107+
- [ ] Terraform module for the cloudflared tunnel (currently created by-hand)
108+
- [ ] Terraform-apply GitHub workflow (`infra/**` path filter)
109+
- [ ] Move the 4 remaining cron paths off the orphan
110+
`container-control-plane` service onto the daemon-queue pattern
111+
(`pool-replenish`, `pool-health-check`, `pool-image-rollout`,
112+
`deployment-monitor`). Once done, retire the
113+
`packages/cloud-services/container-control-plane/` package entirely.
114+
- [ ] Raise Hetzner Cloud server limit (open ticket) to enable autoscale
115+
past the default cap of ~10 servers per account.
116+
117+
## Operator runbook
118+
119+
See [`control-plane/README.md`](./control-plane/README.md)
120+
for the step-by-step:
121+
122+
- Bootstrap a brand-new control-plane VM
123+
- Import the existing production VM into Terraform
124+
- Verify state, plan, apply
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
.terraform/
2+
.terraform.lock.hcl
3+
*.tfstate
4+
*.tfstate.backup
5+
crash.log
6+
crash.*.log
7+
8+
# Live tfvars contain SSH public keys + zone IDs — keep them out of git.
9+
# Only `.tfvars.example` lives in the repo.
10+
*.tfvars
11+
!*.tfvars.example
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# hetzner/control-plane — Terraform for the eliza Cloud control-plane VMs
2+
3+
This Terraform module manages the **persistent** Hetzner Cloud VM(s) that host
4+
the elizaOS Cloud control-plane:
5+
6+
- `eliza-provisioning-worker` — pulls jobs from the `jobs` table and SSHs
7+
into sandbox cores
8+
- `eliza-agent-router` — subdomain HTTP routing
9+
- `cloudflared` — secure tunnel for `sandboxes.waifu.fun`
10+
- `headscale` — VPN mesh for cross-core agent traffic
11+
12+
The **data plane** (the sandbox cores themselves) is **not** managed here —
13+
those are provisioned and drained at runtime by
14+
[`node-autoscaler.ts`](../../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts)
15+
which talks to the Hetzner Cloud API directly. See
16+
[`../ARCHITECTURE.md`](../ARCHITECTURE.md) for the full split.
17+
18+
## Prerequisites
19+
20+
1. **Hetzner Cloud project** with API token (`HCLOUD_TOKEN`).
21+
2. **Cloudflare account** with API token + DNS edit on `elizacloud.ai`
22+
(`CLOUDFLARE_API_TOKEN`).
23+
3. **Cloudflare R2 bucket** `eliza-terraform-state` for remote state. Generate
24+
an R2 API token, edit `backend-staging.hcl` / `backend-production.hcl`
25+
with your CF account ID, then export the R2 token as
26+
`AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` before `terraform init`.
27+
4. **Terraform >= 1.5.0** locally.
28+
29+
## Bootstrap a brand-new control-plane VM (staging)
30+
31+
```bash
32+
cd packages/cloud-infra/cloud/terraform/hetzner/control-plane
33+
34+
# 1. Pull providers + connect remote state.
35+
terraform init -backend-config=backend-staging.hcl
36+
37+
# 2. Copy + fill tfvars.
38+
cp tfvars/staging.tfvars.example tfvars/staging.tfvars
39+
$EDITOR tfvars/staging.tfvars
40+
41+
# 3. Plan + apply.
42+
export HCLOUD_TOKEN=...
43+
export CLOUDFLARE_API_TOKEN=...
44+
terraform plan -var-file=tfvars/staging.tfvars
45+
terraform apply -var-file=tfvars/staging.tfvars
46+
47+
# 4. Output gives you the VM IP. Copy the cloud env file into place:
48+
scp packages/cloud-shared/.env.local root@<vm-ip>:/opt/eliza/cloud/.env.local
49+
50+
# 5. Trigger first deploy from GitHub Actions
51+
# (workflow: deploy-eliza-provisioning-worker.yml, manual dispatch).
52+
```
53+
54+
## Adopt the existing production VM into Terraform
55+
56+
The current prod manager VM (`89.167.63.246`, historical hostname
57+
`milady`) was created by hand in May 2026. To bring it under Terraform
58+
without recreating it, look up the Hetzner Cloud server ID
59+
(`hcloud server list`), then `terraform import 'hcloud_server.control_plane["1"]' <id>`
60+
plus a `terraform import` for each existing `hcloud_ssh_key`. The first
61+
plan after import shows the in-place rename `milady → eliza-1`, the new
62+
labels, and the Cloudflare DNS record creation; `user_data` and `image`
63+
diffs are suppressed by `lifecycle { ignore_changes }`. One-shot — never
64+
re-run.
65+
66+
## What this module does NOT manage (yet)
67+
68+
- Headscale state (preauth keys, ACLs) — manual via `headscale` CLI.
69+
- Cloudflared tunnels — config lives at `/root/.cloudflared/` on the VM and
70+
is created via `cloudflared tunnel create` one-shot.
71+
- The systemd units — installed by `deploy-eliza-provisioning-worker.yml`
72+
on every push.
73+
- The actual eliza Cloud sandbox cores (data plane) — runtime autoscale.
74+
75+
These are tracked as TODOs in
76+
[`../ARCHITECTURE.md`](../ARCHITECTURE.md#followups).
77+
78+
## Cost
79+
80+
| Component | Resource | Monthly (€) |
81+
|------------------------------|-------------|-------------|
82+
| 1× cpx21 (3 vCPU / 4 GB) | control VM | ~5 |
83+
| 1× IPv4 + IPv6 | floating IP | included |
84+
| Cloudflare R2 state | < 100 KB | 0 |
85+
| **Total per environment** | | **~5** |
86+
87+
A 2nd control-plane VM (HA, currently unused) doubles the line. The
88+
**data-plane autoscale** cost is separate and elastic.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
bucket = "eliza-terraform-state"
2+
key = "hetzner/control-plane/production.tfstate"
3+
region = "auto"
4+
endpoints = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" }
5+
skip_credentials_validation = true
6+
skip_metadata_api_check = true
7+
skip_region_validation = true
8+
skip_requesting_account_id = true
9+
use_path_style = true
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Cloudflare R2 backend (S3-compatible) for terraform state.
2+
#
3+
# Set up once per environment:
4+
# 1. Create R2 bucket `eliza-terraform-state` in the elizaOS CF account.
5+
# 2. Generate R2 API token with read/write on that bucket.
6+
# 3. Export AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY pointing at the
7+
# R2 token before `terraform init`.
8+
#
9+
# Usage:
10+
# terraform init -backend-config=backend-staging.hcl
11+
12+
bucket = "eliza-terraform-state"
13+
key = "hetzner/control-plane/staging.tfstate"
14+
region = "auto"
15+
endpoints = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" }
16+
skip_credentials_validation = true
17+
skip_metadata_api_check = true
18+
skip_region_validation = true
19+
skip_requesting_account_id = true
20+
use_path_style = true
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
#cloud-config
2+
# Cloud-init bootstrap for eliza control-plane VMs.
3+
#
4+
# At first boot:
5+
# 1. Install system deps (docker, nginx, node, jq, postgresql-client, ssh)
6+
# 2. Add a `deploy` user that the GitHub Actions workflow
7+
# `.github/workflows/deploy-eliza-provisioning-worker.yml` SSHes into
8+
# 3. Clone the monorepo on `${deploy_branch}` to /opt/eliza
9+
# 4. Install systemd units for `eliza-provisioning-worker` and
10+
# `eliza-agent-router`
11+
#
12+
# Secrets (DATABASE_URL, KV_REST_API_*, STEWARD_*, HCLOUD_TOKEN, etc.) are NOT
13+
# baked in here. The operator copies `cloud/.env.local` to `/opt/eliza/cloud/.env.local`
14+
# in a follow-up step (one-shot, out-of-band) using `scp` or the existing
15+
# `packages/scripts/cloud/admin/bootstrap-provisioning-worker-host.mjs` script.
16+
17+
hostname: ${hostname}
18+
manage_etc_hosts: true
19+
20+
users:
21+
- name: deploy
22+
groups: sudo, docker
23+
shell: /bin/bash
24+
sudo: ALL=(ALL) NOPASSWD:ALL
25+
lock_passwd: true
26+
# Mirror operator SSH keys onto the deploy user. Hetzner only injects
27+
# `hcloud_ssh_key` entries into root by default, but the
28+
# `.github/workflows/deploy-eliza-provisioning-worker.yml` workflow SSHes
29+
# in as `deploy`. Without this list the first auto-deploy would fail.
30+
ssh_authorized_keys:
31+
%{ for key in operator_ssh_keys ~}
32+
- ${key}
33+
%{ endfor ~}
34+
35+
package_update: true
36+
package_upgrade: false
37+
38+
apt:
39+
sources:
40+
docker:
41+
source: "deb [arch=amd64 signed-by=$KEY_FILE] https://download.docker.com/linux/ubuntu $RELEASE stable"
42+
keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88
43+
44+
packages:
45+
- curl
46+
- docker-ce
47+
- docker-ce-cli
48+
- containerd.io
49+
- git
50+
- jq
51+
- nginx
52+
- postgresql-client
53+
- rsync
54+
- unzip
55+
56+
write_files:
57+
- path: /etc/profile.d/eliza-paths.sh
58+
permissions: "0644"
59+
content: |
60+
export PATH="/home/deploy/.bun/bin:$PATH"
61+
export ELIZA_DEPLOY_BRANCH="${deploy_branch}"
62+
63+
runcmd:
64+
# Docker is installed via the `apt` block above using the official
65+
# Docker apt repo (GPG-verified keyring), not `curl get.docker.com | sh`.
66+
- systemctl enable --now docker
67+
68+
# Bun runtime for the deploy user. We download a pinned release tarball
69+
# AND its SHASUMS256 manifest from the same GitHub release, then verify
70+
# the binary against the published hash before extracting. This avoids
71+
# the `curl bun.sh/install | bash` supply-chain pattern flagged in the
72+
# PR review — we still trust GitHub's HTTPS, but the manifest commits
73+
# the publisher to a specific hash that an attacker can't forge mid-flight.
74+
- install -d -o deploy -g deploy /home/deploy/.bun /home/deploy/.bun/bin
75+
- su - deploy -c 'curl -fsSL -o /tmp/bun.zip https://github.com/oven-sh/bun/releases/download/bun-v1.3.13/bun-linux-x64.zip'
76+
- su - deploy -c 'curl -fsSL -o /tmp/bun-SHASUMS256.txt https://github.com/oven-sh/bun/releases/download/bun-v1.3.13/SHASUMS256.txt'
77+
- su - deploy -c 'cd /tmp && grep "bun-linux-x64.zip" bun-SHASUMS256.txt | sha256sum -c -'
78+
- su - deploy -c 'cd /tmp && unzip -q bun.zip && install -m 0755 bun-linux-x64/bun /home/deploy/.bun/bin/bun && rm -rf /tmp/bun.zip /tmp/bun-linux-x64 /tmp/bun-SHASUMS256.txt'
79+
80+
# Repo checkout — the GitHub deploy workflow then takes over for code updates.
81+
- mkdir -p /opt/eliza && chown deploy:deploy /opt/eliza
82+
- su - deploy -c 'git clone --depth 200 --branch ${deploy_branch} https://github.com/elizaOS/eliza.git /opt/eliza'
83+
84+
# Final marker for the bootstrap-warn check that runs after this.
85+
- echo "eliza-control-plane bootstrap finished $(date -u +%Y-%m-%dT%H:%M:%SZ)" > /var/log/eliza-bootstrap.log

0 commit comments

Comments
 (0)