elizaOS · lalalune · May 22, 2026 · May 22, 2026 · May 22, 2026 · May 22, 2026
diff --git a/packages/cloud-infra/cloud/terraform/hetzner/ARCHITECTURE.md b/packages/cloud-infra/cloud/terraform/hetzner/ARCHITECTURE.md
@@ -0,0 +1,124 @@
+# Hetzner Control Plane vs Data Plane
+
+eliza Cloud runs on a two-tier Hetzner Cloud setup. This doc nails down the
+split so we stop treating manually-created VMs as "infrastructure-by-prayer".
+
+## Layers
+
+```
+┌──────────────────────────────────────────────────────────────┐
+│  Tier 1 — Control plane (static, 1-2 VMs, Terraform)        │
+│                                                              │
+│   eliza-1   (Hetzner cpx21, fsn1)             │
+│     ├── eliza-provisioning-worker  (systemd, queue consumer)│
+│     ├── eliza-agent-router         (systemd, HTTP routing)  │
+│     ├── headscale                  (VPN mesh)               │
+│     ├── cloudflared tunnel         (public ingress)         │
+│     ├── nginx                      (reverse proxy)          │
+│     └── (optional: grafana/prometheus)                      │
+│                                                              │
+│   Lifecycle: long-lived. Replaced on demand, not autoscaled.│
+│   Cost: ~€5/mo per VM (cpx21).                              │
+└──────────────────────────────────────────────────────────────┘
+                              │ enqueue / SSH
+                              ▼
+┌──────────────────────────────────────────────────────────────┐
+│  Tier 2 — Data plane (elastic, N cores, runtime autoscale)  │
+│                                                              │
+│   eliza-core-<hex>   (Hetzner cpx32, fsn1)                  │
+│     ├── Docker daemon                                       │
+│     └── eliza-sandbox containers × N                        │
+│                                                              │
+│   Lifecycle: created/drained by node-autoscaler.ts based on │
+│   real demand. Server limit: ~25 (Hetzner default).         │
+│   Cost: elastic (~€11/mo per running cpx32).                │
+└──────────────────────────────────────────────────────────────┘
+```
+
+## Why two tiers
+
+| Concern              | Control plane          | Data plane                |
+|----------------------|------------------------|---------------------------|
+| **Provisioning**     | Terraform (one-shot)   | Runtime API (node-autoscaler.ts) |
+| **Lifecycle**        | Persistent             | Ephemeral                 |
+| **State**            | Has local state (headscale DB, cloudflared creds) | Stateless |
+| **Failure mode**     | Page someone           | Replace automatically     |
+| **Cost predictability** | Fixed monthly       | Elastic                   |
+| **What lives here**  | Orchestrator, routing, monitoring | Just Docker + agents |
+
+The split prevents the "control plane melts with the data plane during a
+traffic spike" failure mode. Pulling sandboxes off the data plane is the
+autoscaler's job; the orchestrator that issues drain commands must stay up
+to coordinate it.
+
+## Code ↔ infrastructure mapping
+
+| Component | Code | Infra |
+|---|---|---|
+| Control plane VM | [`packages/scripts/cloud/admin/daemons/provisioning-worker.ts`](../../../../scripts/cloud/admin/daemons/provisioning-worker.ts) | [Terraform: `control-plane/`](./control-plane/) |
+| Agent router | [`packages/scripts/cloud/admin/daemons/agent-router.ts`](../../../../scripts/cloud/admin/daemons/agent-router.ts) | systemd unit on control-plane VM |
+| Data plane autoscaler | [`packages/cloud-shared/src/lib/services/containers/node-autoscaler.ts`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | Hetzner Cloud API at runtime |
+| Sandbox provisioning | [`packages/cloud-shared/src/lib/services/docker-sandbox-provider.ts`](../../../../cloud-shared/src/lib/services/docker-sandbox-provider.ts) | SSH from control plane to data plane |
+
+## Naming convention
+
+| Layer | Prefix | Example | Where it's set |
+|---|---|---|---|
+| Control plane VM | `eliza-<n>` | `eliza-1` | Terraform `hcloud_server.control_plane` |
+| Data plane node (NEW) | `eliza-core-<hex>` | `eliza-core-38ea87b1` | [`generateNodeId()`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) |
+| Data plane node (LEGACY) | `milady-core-<n>` | `milady-core-1` | DEPRECATED — see [Legacy migration](#legacy-milady-core-migration) |
+
+## Legacy `milady-core-*` migration
+
+Pre-2026-05 the data plane was 6 manually-created `milady-core-*` VMs (Hetzner
+cpx32 in fsn1) inserted by-hand into `docker_nodes` with `capacity = 100`.
+By 2026-05-22:
+
+- All 6 cores were `status: offline` (SSH health-check failing for weeks)
+- Several user sandboxes still ran on the underlying Docker daemons
+- The cloud autoscaler couldn't account for them
+
+Migration 0132 (`0132_legacy_milady_cores_disable.sql`) flips them to
+`enabled = false` + fixes `capacity = 8`. This:
+
+1. Removes them from autoscaler capacity decisions
+2. Stops the health-check noise
+3. Lets the autoscaler spin up replacement `eliza-core-<hex>` nodes on demand
+
+Existing sandboxes keep running until next restart. On user-triggered
+restart / recreate, the daemon provisions them on a fresh autoscaled core.
+
+Once `SELECT SUM(allocated_count) FROM docker_nodes WHERE node_id LIKE 'milady-core-%'`
+is `0`, ops can:
+
+```bash
+# 1. Delete the Hetzner Cloud servers (one-time, via Hetzner console or):
+hcloud server delete milady-core-1
+hcloud server delete milady-core-2
+# ... etc.
+
+# 2. Drop the DB rows:
+DELETE FROM docker_nodes WHERE node_id LIKE 'milady-core-%';
+```
+
+## Followups (not in this initial PR)
+
+- [ ] Terraform module for headscale state (preauth keys, ACLs)
+- [ ] Terraform module for the cloudflared tunnel (currently created by-hand)
+- [ ] Terraform-apply GitHub workflow (`infra/**` path filter)
+- [ ] Move the 4 remaining cron paths off the orphan
+      `container-control-plane` service onto the daemon-queue pattern
+      (`pool-replenish`, `pool-health-check`, `pool-image-rollout`,
+      `deployment-monitor`). Once done, retire the
+      `packages/cloud-services/container-control-plane/` package entirely.
+- [ ] Raise Hetzner Cloud server limit (open ticket) to enable autoscale
+      past the default cap of ~10 servers per account.
+
+## Operator runbook
+
+See [`control-plane/README.md`](./control-plane/README.md)
+for the step-by-step:
+
+- Bootstrap a brand-new control-plane VM
+- Import the existing production VM into Terraform
+- Verify state, plan, apply
diff --git a/packages/cloud-infra/cloud/terraform/hetzner/control-plane/.gitignore b/packages/cloud-infra/cloud/terraform/hetzner/control-plane/.gitignore
@@ -0,0 +1,11 @@
+.terraform/
+.terraform.lock.hcl
+*.tfstate
+*.tfstate.backup
+crash.log
+crash.*.log
+
+# Live tfvars contain SSH public keys + zone IDs — keep them out of git.
+# Only `.tfvars.example` lives in the repo.
+*.tfvars
+!*.tfvars.example
diff --git a/packages/cloud-infra/cloud/terraform/hetzner/control-plane/README.md b/packages/cloud-infra/cloud/terraform/hetzner/control-plane/README.md
@@ -0,0 +1,88 @@
+# hetzner/control-plane — Terraform for the eliza Cloud control-plane VMs
+
+This Terraform module manages the **persistent** Hetzner Cloud VM(s) that host
+the elizaOS Cloud control-plane:
+
+- `eliza-provisioning-worker` — pulls jobs from the `jobs` table and SSHs
+  into sandbox cores
+- `eliza-agent-router` — subdomain HTTP routing
+- `cloudflared` — secure tunnel for `sandboxes.waifu.fun`
+- `headscale` — VPN mesh for cross-core agent traffic
+
+The **data plane** (the sandbox cores themselves) is **not** managed here —
+those are provisioned and drained at runtime by
+[`node-autoscaler.ts`](../../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts)
+which talks to the Hetzner Cloud API directly. See
+[`../ARCHITECTURE.md`](../ARCHITECTURE.md) for the full split.
+
+## Prerequisites
+
+1. **Hetzner Cloud project** with API token (`HCLOUD_TOKEN`).
+2. **Cloudflare account** with API token + DNS edit on `elizacloud.ai`
+   (`CLOUDFLARE_API_TOKEN`).
+3. **Cloudflare R2 bucket** `eliza-terraform-state` for remote state. Generate
+   an R2 API token, edit `backend-staging.hcl` / `backend-production.hcl`
+   with your CF account ID, then export the R2 token as
+   `AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` before `terraform init`.
+4. **Terraform >= 1.5.0** locally.
+
+## Bootstrap a brand-new control-plane VM (staging)
+
+```bash
+cd packages/cloud-infra/cloud/terraform/hetzner/control-plane
+
+# 1. Pull providers + connect remote state.
+terraform init -backend-config=backend-staging.hcl
+
+# 2. Copy + fill tfvars.
+cp tfvars/staging.tfvars.example tfvars/staging.tfvars
+$EDITOR tfvars/staging.tfvars
+
+# 3. Plan + apply.
+export HCLOUD_TOKEN=...
+export CLOUDFLARE_API_TOKEN=...
+terraform plan -var-file=tfvars/staging.tfvars
+terraform apply -var-file=tfvars/staging.tfvars
+
+# 4. Output gives you the VM IP. Copy the cloud env file into place:
+scp packages/cloud-shared/.env.local root@<vm-ip>:/opt/eliza/cloud/.env.local
+
+# 5. Trigger first deploy from GitHub Actions
+#    (workflow: deploy-eliza-provisioning-worker.yml, manual dispatch).
+```
+
+## Adopt the existing production VM into Terraform
+
+The current prod manager VM (`89.167.63.246`, historical hostname
+`milady`) was created by hand in May 2026. To bring it under Terraform
+without recreating it, look up the Hetzner Cloud server ID
+(`hcloud server list`), then `terraform import 'hcloud_server.control_plane["1"]' <id>`
+plus a `terraform import` for each existing `hcloud_ssh_key`. The first
+plan after import shows the in-place rename `milady → eliza-1`, the new
+labels, and the Cloudflare DNS record creation; `user_data` and `image`
+diffs are suppressed by `lifecycle { ignore_changes }`. One-shot — never
+re-run.
+
+## What this module does NOT manage (yet)
+
+- Headscale state (preauth keys, ACLs) — manual via `headscale` CLI.
+- Cloudflared tunnels — config lives at `/root/.cloudflared/` on the VM and
+  is created via `cloudflared tunnel create` one-shot.
+- The systemd units — installed by `deploy-eliza-provisioning-worker.yml`
+  on every push.
+- The actual eliza Cloud sandbox cores (data plane) — runtime autoscale.
+
+These are tracked as TODOs in
+[`../ARCHITECTURE.md`](../ARCHITECTURE.md#followups).
+
+## Cost
+
+| Component                    | Resource    | Monthly (€) |
+|------------------------------|-------------|-------------|
+| 1× cpx21 (3 vCPU / 4 GB)     | control VM  | ~5          |
+| 1× IPv4 + IPv6               | floating IP | included    |
+| Cloudflare R2 state          | < 100 KB    | 0           |
+| **Total per environment**    |             | **~5**      |
+
+A 2nd control-plane VM (HA, currently unused) doubles the line. The
+**data-plane autoscale** cost is separate and elastic.
diff --git a/packages/cloud-infra/cloud/terraform/hetzner/control-plane/backend-production.hcl b/packages/cloud-infra/cloud/terraform/hetzner/control-plane/backend-production.hcl
@@ -0,0 +1,9 @@
+bucket                      = "eliza-terraform-state"
+key                         = "hetzner/control-plane/production.tfstate"
+region                      = "auto"
+endpoints                   = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" }
+skip_credentials_validation = true
+skip_metadata_api_check     = true
+skip_region_validation      = true
+skip_requesting_account_id  = true
+use_path_style              = true
diff --git a/packages/cloud-infra/cloud/terraform/hetzner/control-plane/backend-staging.hcl b/packages/cloud-infra/cloud/terraform/hetzner/control-plane/backend-staging.hcl
@@ -0,0 +1,20 @@
+# Cloudflare R2 backend (S3-compatible) for terraform state.
+#
+# Set up once per environment:
+#   1. Create R2 bucket `eliza-terraform-state` in the elizaOS CF account.
+#   2. Generate R2 API token with read/write on that bucket.
+#   3. Export AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY pointing at the
+#      R2 token before `terraform init`.
+#
+# Usage:
+#   terraform init -backend-config=backend-staging.hcl
+
+bucket                      = "eliza-terraform-state"
+key                         = "hetzner/control-plane/staging.tfstate"
+region                      = "auto"
+endpoints                   = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" }
+skip_credentials_validation = true
+skip_metadata_api_check     = true
+skip_region_validation      = true
+skip_requesting_account_id  = true
+use_path_style              = true
diff --git a/packages/cloud-infra/cloud/terraform/hetzner/control-plane/cloud-init/bootstrap.yaml.tftpl b/packages/cloud-infra/cloud/terraform/hetzner/control-plane/cloud-init/bootstrap.yaml.tftpl
@@ -0,0 +1,85 @@
+#cloud-config
+# Cloud-init bootstrap for eliza control-plane VMs.
+#
+# At first boot:
+#   1. Install system deps (docker, nginx, node, jq, postgresql-client, ssh)
+#   2. Add a `deploy` user that the GitHub Actions workflow
+#      `.github/workflows/deploy-eliza-provisioning-worker.yml` SSHes into
+#   3. Clone the monorepo on `${deploy_branch}` to /opt/eliza
+#   4. Install systemd units for `eliza-provisioning-worker` and
+#      `eliza-agent-router`
+#
+# Secrets (DATABASE_URL, KV_REST_API_*, STEWARD_*, HCLOUD_TOKEN, etc.) are NOT
+# baked in here. The operator copies `cloud/.env.local` to `/opt/eliza/cloud/.env.local`
+# in a follow-up step (one-shot, out-of-band) using `scp` or the existing
+# `packages/scripts/cloud/admin/bootstrap-provisioning-worker-host.mjs` script.
+
+hostname: ${hostname}
+manage_etc_hosts: true
+
+users:
+  - name: deploy
+    groups: sudo, docker
+    shell: /bin/bash
+    sudo: ALL=(ALL) NOPASSWD:ALL
+    lock_passwd: true
+    # Mirror operator SSH keys onto the deploy user. Hetzner only injects
+    # `hcloud_ssh_key` entries into root by default, but the
+    # `.github/workflows/deploy-eliza-provisioning-worker.yml` workflow SSHes
+    # in as `deploy`. Without this list the first auto-deploy would fail.
+    ssh_authorized_keys:
+%{ for key in operator_ssh_keys ~}
+      - ${key}
+%{ endfor ~}
+
+package_update: true
+package_upgrade: false
+
+apt:
+  sources:
+    docker:
+      source: "deb [arch=amd64 signed-by=$KEY_FILE] https://download.docker.com/linux/ubuntu $RELEASE stable"
+      keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88
+
+packages:
+  - curl
+  - docker-ce
+  - docker-ce-cli
+  - containerd.io
+  - git
+  - jq
+  - nginx
+  - postgresql-client
+  - rsync
+  - unzip
+
+write_files:
+  - path: /etc/profile.d/eliza-paths.sh
+    permissions: "0644"
+    content: |
+      export PATH="/home/deploy/.bun/bin:$PATH"
+      export ELIZA_DEPLOY_BRANCH="${deploy_branch}"
+
+runcmd:
+  # Docker is installed via the `apt` block above using the official
+  # Docker apt repo (GPG-verified keyring), not `curl get.docker.com | sh`.
+  - systemctl enable --now docker
+
+  # Bun runtime for the deploy user. We download a pinned release tarball
+  # AND its SHASUMS256 manifest from the same GitHub release, then verify
+  # the binary against the published hash before extracting. This avoids
+  # the `curl bun.sh/install | bash` supply-chain pattern flagged in the
+  # PR review — we still trust GitHub's HTTPS, but the manifest commits
+  # the publisher to a specific hash that an attacker can't forge mid-flight.
+  - install -d -o deploy -g deploy /home/deploy/.bun /home/deploy/.bun/bin
+  - su - deploy -c 'curl -fsSL -o /tmp/bun.zip https://github.com/oven-sh/bun/releases/download/bun-v1.3.13/bun-linux-x64.zip'
+  - su - deploy -c 'curl -fsSL -o /tmp/bun-SHASUMS256.txt https://github.com/oven-sh/bun/releases/download/bun-v1.3.13/SHASUMS256.txt'
+  - su - deploy -c 'cd /tmp && grep "bun-linux-x64.zip" bun-SHASUMS256.txt | sha256sum -c -'
+  - su - deploy -c 'cd /tmp && unzip -q bun.zip && install -m 0755 bun-linux-x64/bun /home/deploy/.bun/bin/bun && rm -rf /tmp/bun.zip /tmp/bun-linux-x64 /tmp/bun-SHASUMS256.txt'
+
+  # Repo checkout — the GitHub deploy workflow then takes over for code updates.
+  - mkdir -p /opt/eliza && chown deploy:deploy /opt/eliza
+  - su - deploy -c 'git clone --depth 200 --branch ${deploy_branch} https://github.com/elizaOS/eliza.git /opt/eliza'
+
+  # Final marker for the bootstrap-warn check that runs after this.
+  - echo "eliza-control-plane bootstrap finished $(date -u +%Y-%m-%dT%H:%M:%SZ)" > /var/log/eliza-bootstrap.log