Skip to content

Commit 3a896d4

Browse files
committed
feat(cloud): Hetzner control plane IaC + data plane naming + legacy milady-core deprecation
Three coordinated pieces: 1. Terraform module `packages/cloud-infra/cloud/terraform/hetzner/control-plane/` declares the persistent VM(s) that host the orchestrator daemon (provisioning-worker, agent-router, headscale, cloudflared). Uses hetznercloud/hcloud + cloudflare providers, Cloudflare R2 as S3 state backend. Includes a cloud-init bootstrap template, tfvars examples, and a README walkthrough for both new-host bootstrap and `terraform import` of the existing prod VM (89.167.63.246) into state. 2. Data-plane naming: `node-<hex>` becomes `eliza-core-<hex>` going forward. `generateNodeId()` now sources entropy from `crypto.getRandomValues()` instead of `Math.random().toString(16)`, which silently strips trailing zeros and could produce short or colliding suffixes when `node_id` is UNIQUE in `docker_nodes`. 3. Data-plane location default fixed: Hetzner deprecated cpx32 on `ash` (Ashburn), so the previous `defaultHcloudLocation = "ash"` default fails with "unsupported location for server type". Flipped to `fsn1` to match the actual prod fleet. 4. Migration 0132 disables the 6 legacy `milady-core-*` rows (`enabled=false`, `capacity=8`). They were inserted by hand in 2026-03 with `capacity=100` (unrealistic for cpx32), have been health-check offline for weeks, and are now ignored by the autoscaler. Existing sandboxes keep running on the underlying Docker daemons until their next user-triggered restart, at which point the daemon provisions a replacement on a fresh autoscaled core. Ops follow-up (delete Hetzner servers + DB rows) is documented in the architecture markdown. ARCHITECTURE.md formalises the two-tier model (static control plane vs elastic data plane) so future ops actions have a clear runbook. Followups (separate PRs): Terraform modules for headscale state + the cloudflared tunnel; terraform-apply GitHub workflow; rapatriating the 4 remaining cron paths off the orphan container-control-plane service onto the daemon-queue pattern; raising the Hetzner Cloud server-count limit. Tests: - 2 new sociable tests for generateNodeId() asserting the prefix + exactly 8 lowercase hex chars + uniqueness across 50 calls. All 5 node-autoscaler tests pass. Out-of-band ops actions needed before merging to production: - Generate R2 API token + create the bucket entry (already done: eliza-terraform-state in WEUR) - Set environment secrets used by the daemon: HETZNER_CLOUD_API_KEY, CONTAINERS_AUTOSCALE_PUBLIC_SSH_KEY (already done on staging VM) - Open Hetzner ticket to raise server-count limit past 10 so autoscale can actually create replacement cores
1 parent 3231145 commit 3a896d4

18 files changed

Lines changed: 614 additions & 5 deletions

File tree

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# Hetzner Control Plane vs Data Plane
2+
3+
eliza Cloud runs on a two-tier Hetzner Cloud setup. This doc nails down the
4+
split so we stop treating manually-created VMs as "infrastructure-by-prayer".
5+
6+
## Layers
7+
8+
```
9+
┌──────────────────────────────────────────────────────────────┐
10+
│ Tier 1 — Control plane (static, 1-2 VMs, Terraform) │
11+
│ │
12+
│ eliza-cp-production-1 (Hetzner cpx21, fsn1) │
13+
│ ├── eliza-provisioning-worker (systemd, queue consumer)│
14+
│ ├── eliza-agent-router (systemd, HTTP routing) │
15+
│ ├── headscale (VPN mesh) │
16+
│ ├── cloudflared tunnel (public ingress) │
17+
│ ├── nginx (reverse proxy) │
18+
│ └── (optional: grafana/prometheus) │
19+
│ │
20+
│ Lifecycle: long-lived. Replaced on demand, not autoscaled.│
21+
│ Cost: ~€5/mo per VM (cpx21). │
22+
└──────────────────────────────────────────────────────────────┘
23+
│ enqueue / SSH
24+
25+
┌──────────────────────────────────────────────────────────────┐
26+
│ Tier 2 — Data plane (elastic, N cores, runtime autoscale) │
27+
│ │
28+
│ eliza-core-<hex> (Hetzner cpx32, fsn1) │
29+
│ ├── Docker daemon │
30+
│ └── eliza-sandbox containers × N │
31+
│ │
32+
│ Lifecycle: created/drained by node-autoscaler.ts based on │
33+
│ real demand. Server limit: ~25 (Hetzner default). │
34+
│ Cost: elastic (~€11/mo per running cpx32). │
35+
└──────────────────────────────────────────────────────────────┘
36+
```
37+
38+
## Why two tiers
39+
40+
| Concern | Control plane | Data plane |
41+
|----------------------|------------------------|---------------------------|
42+
| **Provisioning** | Terraform (one-shot) | Runtime API (node-autoscaler.ts) |
43+
| **Lifecycle** | Persistent | Ephemeral |
44+
| **State** | Has local state (headscale DB, cloudflared creds) | Stateless |
45+
| **Failure mode** | Page someone | Replace automatically |
46+
| **Cost predictability** | Fixed monthly | Elastic |
47+
| **What lives here** | Orchestrator, routing, monitoring | Just Docker + agents |
48+
49+
The split prevents the "control plane melts with the data plane during a
50+
traffic spike" failure mode. Pulling sandboxes off the data plane is the
51+
autoscaler's job; the orchestrator that issues drain commands must stay up
52+
to coordinate it.
53+
54+
## Code ↔ infrastructure mapping
55+
56+
| Component | Code | Infra |
57+
|---|---|---|
58+
| Control plane VM | [`packages/scripts/cloud/admin/daemons/provisioning-worker.ts`](../../../../scripts/cloud/admin/daemons/provisioning-worker.ts) | [Terraform: `control-plane/`](./control-plane/) |
59+
| Agent router | [`packages/scripts/cloud/admin/daemons/agent-router.ts`](../../../../scripts/cloud/admin/daemons/agent-router.ts) | systemd unit on control-plane VM |
60+
| Data plane autoscaler | [`packages/cloud-shared/src/lib/services/containers/node-autoscaler.ts`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | Hetzner Cloud API at runtime |
61+
| Sandbox provisioning | [`packages/cloud-shared/src/lib/services/docker-sandbox-provider.ts`](../../../../cloud-shared/src/lib/services/docker-sandbox-provider.ts) | SSH from control plane to data plane |
62+
63+
## Naming convention
64+
65+
| Layer | Prefix | Example | Where it's set |
66+
|---|---|---|---|
67+
| Control plane VM | `eliza-cp-<env>-<n>` | `eliza-cp-production-1` | Terraform `var.environment` |
68+
| Data plane node (NEW) | `eliza-core-<hex>` | `eliza-core-38ea87b1` | [`generateNodeId()`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) |
69+
| Data plane node (LEGACY) | `milady-core-<n>` | `milady-core-1` | DEPRECATED — see [Legacy migration](#legacy-milady-core-migration) |
70+
71+
## Legacy `milady-core-*` migration
72+
73+
Pre-2026-05 the data plane was 6 manually-created `milady-core-*` VMs (Hetzner
74+
cpx32 in fsn1) inserted by-hand into `docker_nodes` with `capacity = 100`.
75+
By 2026-05-22:
76+
77+
- All 6 cores were `status: offline` (SSH health-check failing for weeks)
78+
- Several user sandboxes still ran on the underlying Docker daemons
79+
- The cloud autoscaler couldn't account for them
80+
81+
Migration 0132 (`0132_legacy_milady_cores_disable.sql`) flips them to
82+
`enabled = false` + fixes `capacity = 8`. This:
83+
84+
1. Removes them from autoscaler capacity decisions
85+
2. Stops the health-check noise
86+
3. Lets the autoscaler spin up replacement `eliza-core-<hex>` nodes on demand
87+
88+
Existing sandboxes keep running until next restart. On user-triggered
89+
restart / recreate, the daemon provisions them on a fresh autoscaled core.
90+
91+
Once `SELECT SUM(allocated_count) FROM docker_nodes WHERE node_id LIKE 'milady-core-%'`
92+
is `0`, ops can:
93+
94+
```bash
95+
# 1. Delete the Hetzner Cloud servers (one-time, via Hetzner console or):
96+
hcloud server delete milady-core-1
97+
hcloud server delete milady-core-2
98+
# ... etc.
99+
100+
# 2. Drop the DB rows:
101+
DELETE FROM docker_nodes WHERE node_id LIKE 'milady-core-%';
102+
```
103+
104+
## Followups (not in this initial PR)
105+
106+
- [ ] Terraform module for headscale state (preauth keys, ACLs)
107+
- [ ] Terraform module for the cloudflared tunnel (currently created by-hand)
108+
- [ ] Terraform-apply GitHub workflow (`infra/**` path filter)
109+
- [ ] Move the 4 remaining cron paths off the orphan
110+
`container-control-plane` service onto the daemon-queue pattern
111+
(`pool-replenish`, `pool-health-check`, `pool-image-rollout`,
112+
`deployment-monitor`). Once done, retire the
113+
`packages/cloud-services/container-control-plane/` package entirely.
114+
- [ ] Raise Hetzner Cloud server limit (open ticket) to enable autoscale
115+
past the default cap of ~10 servers per account.
116+
117+
## Operator runbook
118+
119+
See [`control-plane/README.md`](./control-plane/README.md)
120+
for the step-by-step:
121+
122+
- Bootstrap a brand-new control-plane VM
123+
- Import the existing production VM into Terraform
124+
- Verify state, plan, apply
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
.terraform/
2+
.terraform.lock.hcl
3+
*.tfstate
4+
*.tfstate.backup
5+
crash.log
6+
crash.*.log
7+
8+
# Live tfvars contain SSH public keys + zone IDs — keep them out of git.
9+
# Only `.tfvars.example` lives in the repo.
10+
*.tfvars
11+
!*.tfvars.example
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# hetzner/control-plane — Terraform for the eliza Cloud control-plane VMs
2+
3+
This Terraform module manages the **persistent** Hetzner Cloud VM(s) that host
4+
the elizaOS Cloud control-plane:
5+
6+
- `eliza-provisioning-worker` — pulls jobs from the `jobs` table and SSHs
7+
into sandbox cores
8+
- `eliza-agent-router` — subdomain HTTP routing
9+
- `cloudflared` — secure tunnel for `sandboxes.waifu.fun`
10+
- `headscale` — VPN mesh for cross-core agent traffic
11+
12+
The **data plane** (the sandbox cores themselves) is **not** managed here —
13+
those are provisioned and drained at runtime by
14+
[`node-autoscaler.ts`](../../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts)
15+
which talks to the Hetzner Cloud API directly. See
16+
[`../ARCHITECTURE.md`](../ARCHITECTURE.md) for the full split.
17+
18+
## Prerequisites
19+
20+
1. **Hetzner Cloud project** with API token (`HCLOUD_TOKEN`).
21+
2. **Cloudflare account** with API token + DNS edit on `elizacloud.ai`
22+
(`CLOUDFLARE_API_TOKEN`).
23+
3. **Cloudflare R2 bucket** `eliza-terraform-state` for remote state. Generate
24+
an R2 API token, edit `backend-staging.hcl` / `backend-production.hcl`
25+
with your CF account ID, then export the R2 token as
26+
`AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` before `terraform init`.
27+
4. **Terraform >= 1.5.0** locally.
28+
29+
## Bootstrap a brand-new control-plane VM (staging)
30+
31+
```bash
32+
cd packages/cloud-infra/cloud/terraform/hetzner/control-plane
33+
34+
# 1. Pull providers + connect remote state.
35+
terraform init -backend-config=backend-staging.hcl
36+
37+
# 2. Copy + fill tfvars.
38+
cp tfvars/staging.tfvars.example tfvars/staging.tfvars
39+
$EDITOR tfvars/staging.tfvars
40+
41+
# 3. Plan + apply.
42+
export HCLOUD_TOKEN=...
43+
export CLOUDFLARE_API_TOKEN=...
44+
terraform plan -var-file=tfvars/staging.tfvars
45+
terraform apply -var-file=tfvars/staging.tfvars
46+
47+
# 4. Output gives you the VM IP. Copy the cloud env file into place:
48+
scp packages/cloud-shared/.env.local root@<vm-ip>:/opt/eliza/cloud/.env.local
49+
50+
# 5. Trigger first deploy from GitHub Actions
51+
# (workflow: deploy-eliza-provisioning-worker.yml, manual dispatch).
52+
```
53+
54+
## Adopt the existing production VM into Terraform
55+
56+
The current prod manager VM (`89.167.63.246`, hostname `milady`) was
57+
created by-hand via `packages/scripts/cloud/admin/bootstrap-provisioning-worker-host.mjs`
58+
on May 7th. To bring it under Terraform without recreating it:
59+
60+
```bash
61+
# 1. Look up the Hetzner Cloud server ID:
62+
hcloud server list # find the one with IP 89.167.63.246
63+
64+
# 2. Import the existing resource into state:
65+
terraform init -backend-config=backend-production.hcl
66+
terraform import \
67+
-var-file=tfvars/production.tfvars \
68+
'hcloud_server.control_plane["1"]' \
69+
<SERVER_ID>
70+
71+
# 3. Run `terraform plan` and adjust variables until the diff is empty.
72+
# Common drift: labels, ssh_keys (manually added during initial setup).
73+
```
74+
75+
## What this module does NOT manage (yet)
76+
77+
- Headscale state (preauth keys, ACLs) — manual via `headscale` CLI.
78+
- Cloudflared tunnels — config lives at `/root/.cloudflared/` on the VM and
79+
is created via `cloudflared tunnel create` one-shot.
80+
- The systemd units — installed by `deploy-eliza-provisioning-worker.yml`
81+
on every push.
82+
- The actual eliza Cloud sandbox cores (data plane) — runtime autoscale.
83+
84+
These are tracked as TODOs in
85+
[`../ARCHITECTURE.md`](../ARCHITECTURE.md#followups).
86+
87+
## Cost
88+
89+
| Component | Resource | Monthly (€) |
90+
|------------------------------|-------------|-------------|
91+
| 1× cpx21 (3 vCPU / 4 GB) | control VM | ~5 |
92+
| 1× IPv4 + IPv6 | floating IP | included |
93+
| Cloudflare R2 state | < 100 KB | 0 |
94+
| **Total per environment** | | **~5** |
95+
96+
A 2nd control-plane VM (HA, currently unused) doubles the line. The
97+
**data-plane autoscale** cost is separate and elastic.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
bucket = "eliza-terraform-state"
2+
key = "hetzner/control-plane/production.tfstate"
3+
region = "auto"
4+
endpoints = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" }
5+
skip_credentials_validation = true
6+
skip_metadata_api_check = true
7+
skip_region_validation = true
8+
skip_requesting_account_id = true
9+
use_path_style = true
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Cloudflare R2 backend (S3-compatible) for terraform state.
2+
#
3+
# Set up once per environment:
4+
# 1. Create R2 bucket `eliza-terraform-state` in the elizaOS CF account.
5+
# 2. Generate R2 API token with read/write on that bucket.
6+
# 3. Export AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY pointing at the
7+
# R2 token before `terraform init`.
8+
#
9+
# Usage:
10+
# terraform init -backend-config=backend-staging.hcl
11+
12+
bucket = "eliza-terraform-state"
13+
key = "hetzner/control-plane/staging.tfstate"
14+
region = "auto"
15+
endpoints = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" }
16+
skip_credentials_validation = true
17+
skip_metadata_api_check = true
18+
skip_region_validation = true
19+
skip_requesting_account_id = true
20+
use_path_style = true
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
#cloud-config
2+
# Cloud-init bootstrap for eliza control-plane VMs.
3+
#
4+
# At first boot:
5+
# 1. Install system deps (docker, nginx, node, jq, postgresql-client, ssh)
6+
# 2. Add a `deploy` user that the GitHub Actions workflow
7+
# `.github/workflows/deploy-eliza-provisioning-worker.yml` SSHes into
8+
# 3. Clone the monorepo on `${deploy_branch}` to /opt/eliza
9+
# 4. Install systemd units for `eliza-provisioning-worker` and
10+
# `eliza-agent-router`
11+
#
12+
# Secrets (DATABASE_URL, KV_REST_API_*, STEWARD_*, HCLOUD_TOKEN, etc.) are NOT
13+
# baked in here. The operator copies `cloud/.env.local` to `/opt/eliza/cloud/.env.local`
14+
# in a follow-up step (one-shot, out-of-band) using `scp` or the existing
15+
# `packages/scripts/cloud/admin/bootstrap-provisioning-worker-host.mjs` script.
16+
17+
hostname: ${hostname}
18+
manage_etc_hosts: true
19+
20+
users:
21+
- name: deploy
22+
groups: sudo, docker
23+
shell: /bin/bash
24+
sudo: ALL=(ALL) NOPASSWD:ALL
25+
lock_passwd: true
26+
27+
package_update: true
28+
package_upgrade: false
29+
30+
packages:
31+
- curl
32+
- git
33+
- jq
34+
- nginx
35+
- postgresql-client
36+
- rsync
37+
- unzip
38+
39+
write_files:
40+
- path: /etc/profile.d/eliza-paths.sh
41+
permissions: "0644"
42+
content: |
43+
export PATH="/home/deploy/.bun/bin:$PATH"
44+
export ELIZA_DEPLOY_BRANCH="${deploy_branch}"
45+
46+
runcmd:
47+
# Docker (official convenience installer — pin sha if you want it
48+
# auditable; we keep it simple for the bootstrap).
49+
- curl -fsSL https://get.docker.com | sh
50+
- systemctl enable --now docker
51+
52+
# Bun runtime for the deploy user (the daemons run under bun/tsx).
53+
- su - deploy -c 'curl -fsSL https://bun.sh/install | bash -s "bun-v1.3.13"'
54+
55+
# Repo checkout — the GitHub deploy workflow then takes over for code updates.
56+
- mkdir -p /opt/eliza && chown deploy:deploy /opt/eliza
57+
- su - deploy -c 'git clone --depth 200 --branch ${deploy_branch} https://github.com/elizaOS/eliza.git /opt/eliza'
58+
59+
# Final marker for the bootstrap-warn check that runs after this.
60+
- echo "eliza-control-plane bootstrap finished $(date -u +%Y-%m-%dT%H:%M:%SZ)" > /var/log/eliza-bootstrap.log
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
locals {
2+
# Tags applied to every Hetzner Cloud resource managed here. Mirrors the
3+
# data-plane convention (`managed-by: eliza-cloud`) used by the runtime
4+
# autoscaler so a single search in the Hetzner Console reveals everything.
5+
common_labels = {
6+
"managed-by" = "eliza-cloud"
7+
"tier" = "control-plane"
8+
"environment" = var.environment
9+
}
10+
}
11+
12+
resource "hcloud_ssh_key" "operators" {
13+
for_each = { for idx, key in var.ssh_public_keys : idx => key }
14+
15+
name = "eliza-cp-${var.environment}-op-${each.key}"
16+
public_key = each.value
17+
labels = local.common_labels
18+
}
19+
20+
resource "hcloud_server" "control_plane" {
21+
for_each = toset([for i in range(var.control_plane_count) : tostring(i + 1)])
22+
23+
name = "eliza-cp-${var.environment}-${each.value}"
24+
location = var.hcloud_location
25+
server_type = var.hcloud_server_type
26+
image = var.hcloud_image
27+
ssh_keys = [for k in hcloud_ssh_key.operators : k.id]
28+
labels = merge(local.common_labels, {
29+
"control-plane-index" = each.value
30+
})
31+
32+
user_data = templatefile("${path.module}/cloud-init/bootstrap.yaml.tftpl", {
33+
hostname = "eliza-cp-${var.environment}-${each.value}"
34+
deploy_branch = var.deploy_branch
35+
})
36+
37+
# Keep server alive across refactors: changing labels or user_data
38+
# shouldn't recreate the box, only update in place where possible.
39+
lifecycle {
40+
ignore_changes = [
41+
user_data, # bootstrap runs once at first boot
42+
image, # updating image rebuilds — explicit `terraform taint` to opt in
43+
]
44+
}
45+
}
46+
47+
resource "cloudflare_dns_record" "control_plane" {
48+
for_each = hcloud_server.control_plane
49+
50+
zone_id = var.cloudflare_zone_id
51+
name = "${var.control_plane_hostname_prefix}-${var.environment}-${each.key}.elizacloud.ai"
52+
type = "A"
53+
content = each.value.ipv4_address
54+
ttl = 60
55+
proxied = false
56+
comment = "eliza control-plane VM ${each.value.name} (managed by terraform/hetzner/control-plane)"
57+
}

0 commit comments

Comments
 (0)