elizaOS · lalalune · May 22, 2026 · May 22, 2026 · May 22, 2026 · May 22, 2026
diff --git a/packages/cloud-infra/cloud/terraform/hetzner/ARCHITECTURE.md b/packages/cloud-infra/cloud/terraform/hetzner/ARCHITECTURE.md
@@ -0,0 +1,124 @@
+# Hetzner Control Plane vs Data Plane
+
+eliza Cloud runs on a two-tier Hetzner Cloud setup. This doc nails down the
+split so we stop treating manually-created VMs as "infrastructure-by-prayer".
+
+## Layers
+
+```
+┌──────────────────────────────────────────────────────────────┐
+│  Tier 1 — Control plane (static, 1-2 VMs, Terraform)        │
+│                                                              │
+│   eliza-cp-production-1   (Hetzner cpx21, fsn1)             │
+│     ├── eliza-provisioning-worker  (systemd, queue consumer)│
+│     ├── eliza-agent-router         (systemd, HTTP routing)  │
+│     ├── headscale                  (VPN mesh)               │
+│     ├── cloudflared tunnel         (public ingress)         │
+│     ├── nginx                      (reverse proxy)          │
+│     └── (optional: grafana/prometheus)                      │
+│                                                              │
+│   Lifecycle: long-lived. Replaced on demand, not autoscaled.│
+│   Cost: ~€5/mo per VM (cpx21).                              │
+└──────────────────────────────────────────────────────────────┘
+                              │ enqueue / SSH
+                              ▼
+┌──────────────────────────────────────────────────────────────┐
+│  Tier 2 — Data plane (elastic, N cores, runtime autoscale)  │
+│                                                              │
+│   eliza-core-<hex>   (Hetzner cpx32, fsn1)                  │
+│     ├── Docker daemon                                       │
+│     └── eliza-sandbox containers × N                        │
+│                                                              │
+│   Lifecycle: created/drained by node-autoscaler.ts based on │
+│   real demand. Server limit: ~25 (Hetzner default).         │
+│   Cost: elastic (~€11/mo per running cpx32).                │
+└──────────────────────────────────────────────────────────────┘
+```
+
+## Why two tiers
+
+| Concern              | Control plane          | Data plane                |
+|----------------------|------------------------|---------------------------|
+| **Provisioning**     | Terraform (one-shot)   | Runtime API (node-autoscaler.ts) |
+| **Lifecycle**        | Persistent             | Ephemeral                 |
+| **State**            | Has local state (headscale DB, cloudflared creds) | Stateless |
+| **Failure mode**     | Page someone           | Replace automatically     |
+| **Cost predictability** | Fixed monthly       | Elastic                   |
+| **What lives here**  | Orchestrator, routing, monitoring | Just Docker + agents |
+
+The split prevents the "control plane melts with the data plane during a
+traffic spike" failure mode. Pulling sandboxes off the data plane is the
+autoscaler's job; the orchestrator that issues drain commands must stay up
+to coordinate it.
+
+## Code ↔ infrastructure mapping
+
+| Component | Code | Infra |
+|---|---|---|
+| Control plane VM | [`packages/scripts/cloud/admin/daemons/provisioning-worker.ts`](../../../../scripts/cloud/admin/daemons/provisioning-worker.ts) | [Terraform: `control-plane/`](./control-plane/) |
+| Agent router | [`packages/scripts/cloud/admin/daemons/agent-router.ts`](../../../../scripts/cloud/admin/daemons/agent-router.ts) | systemd unit on control-plane VM |
+| Data plane autoscaler | [`packages/cloud-shared/src/lib/services/containers/node-autoscaler.ts`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | Hetzner Cloud API at runtime |
+| Sandbox provisioning | [`packages/cloud-shared/src/lib/services/docker-sandbox-provider.ts`](../../../../cloud-shared/src/lib/services/docker-sandbox-provider.ts) | SSH from control plane to data plane |
+
+## Naming convention
+
+| Layer | Prefix | Example | Where it's set |
+|---|---|---|---|
+| Control plane VM | `eliza-cp-<env>-<n>` | `eliza-cp-production-1` | Terraform `var.environment` |
+| Data plane node (NEW) | `eliza-core-<hex>` | `eliza-core-38ea87b1` | [`generateNodeId()`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) |
+| Data plane node (LEGACY) | `milady-core-<n>` | `milady-core-1` | DEPRECATED — see [Legacy migration](#legacy-milady-core-migration) |
+
+## Legacy `milady-core-*` migration
+
+Pre-2026-05 the data plane was 6 manually-created `milady-core-*` VMs (Hetzner
+cpx32 in fsn1) inserted by-hand into `docker_nodes` with `capacity = 100`.
+By 2026-05-22:
+
+- All 6 cores were `status: offline` (SSH health-check failing for weeks)
+- Several user sandboxes still ran on the underlying Docker daemons
+- The cloud autoscaler couldn't account for them
+
+Migration 0132 (`0132_legacy_milady_cores_disable.sql`) flips them to
+`enabled = false` + fixes `capacity = 8`. This:
+
+1. Removes them from autoscaler capacity decisions
+2. Stops the health-check noise
+3. Lets the autoscaler spin up replacement `eliza-core-<hex>` nodes on demand
+
+Existing sandboxes keep running until next restart. On user-triggered
+restart / recreate, the daemon provisions them on a fresh autoscaled core.
+
+Once `SELECT SUM(allocated_count) FROM docker_nodes WHERE node_id LIKE 'milady-core-%'`
+is `0`, ops can:
+
+```bash
+# 1. Delete the Hetzner Cloud servers (one-time, via Hetzner console or):
+hcloud server delete milady-core-1
+hcloud server delete milady-core-2
+# ... etc.
+
+# 2. Drop the DB rows:
+DELETE FROM docker_nodes WHERE node_id LIKE 'milady-core-%';
+```
+
+## Followups (not in this initial PR)
+
+- [ ] Terraform module for headscale state (preauth keys, ACLs)
+- [ ] Terraform module for the cloudflared tunnel (currently created by-hand)
+- [ ] Terraform-apply GitHub workflow (`infra/**` path filter)
+- [ ] Move the 4 remaining cron paths off the orphan
+      `container-control-plane` service onto the daemon-queue pattern
+      (`pool-replenish`, `pool-health-check`, `pool-image-rollout`,
+      `deployment-monitor`). Once done, retire the
+      `packages/cloud-services/container-control-plane/` package entirely.
+- [ ] Raise Hetzner Cloud server limit (open ticket) to enable autoscale
+      past the default cap of ~10 servers per account.
+
+## Operator runbook
+
+See [`control-plane/README.md`](./control-plane/README.md)
+for the step-by-step:
+
+- Bootstrap a brand-new control-plane VM
+- Import the existing production VM into Terraform
+- Verify state, plan, apply
diff --git a/packages/cloud-infra/cloud/terraform/hetzner/control-plane/.gitignore b/packages/cloud-infra/cloud/terraform/hetzner/control-plane/.gitignore
@@ -0,0 +1,11 @@
+.terraform/
+.terraform.lock.hcl
+*.tfstate
+*.tfstate.backup
+crash.log
+crash.*.log
+
+# Live tfvars contain SSH public keys + zone IDs — keep them out of git.
+# Only `.tfvars.example` lives in the repo.
+*.tfvars
+!*.tfvars.example
diff --git a/packages/cloud-infra/cloud/terraform/hetzner/control-plane/README.md b/packages/cloud-infra/cloud/terraform/hetzner/control-plane/README.md
@@ -0,0 +1,97 @@
+# hetzner/control-plane — Terraform for the eliza Cloud control-plane VMs
+
+This Terraform module manages the **persistent** Hetzner Cloud VM(s) that host
+the elizaOS Cloud control-plane:
+
+- `eliza-provisioning-worker` — pulls jobs from the `jobs` table and SSHs
+  into sandbox cores
+- `eliza-agent-router` — subdomain HTTP routing
+- `cloudflared` — secure tunnel for `sandboxes.waifu.fun`
+- `headscale` — VPN mesh for cross-core agent traffic
+
+The **data plane** (the sandbox cores themselves) is **not** managed here —
+those are provisioned and drained at runtime by
+[`node-autoscaler.ts`](../../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts)
+which talks to the Hetzner Cloud API directly. See
+[`../ARCHITECTURE.md`](../ARCHITECTURE.md) for the full split.
+
+## Prerequisites
+
+1. **Hetzner Cloud project** with API token (`HCLOUD_TOKEN`).
+2. **Cloudflare account** with API token + DNS edit on `elizacloud.ai`
+   (`CLOUDFLARE_API_TOKEN`).
+3. **Cloudflare R2 bucket** `eliza-terraform-state` for remote state. Generate
+   an R2 API token, edit `backend-staging.hcl` / `backend-production.hcl`
+   with your CF account ID, then export the R2 token as
+   `AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` before `terraform init`.
+4. **Terraform >= 1.5.0** locally.
+
+## Bootstrap a brand-new control-plane VM (staging)
+
+```bash
+cd packages/cloud-infra/cloud/terraform/hetzner/control-plane
+
+# 1. Pull providers + connect remote state.
+terraform init -backend-config=backend-staging.hcl
+
+# 2. Copy + fill tfvars.
+cp tfvars/staging.tfvars.example tfvars/staging.tfvars
+$EDITOR tfvars/staging.tfvars
+
+# 3. Plan + apply.
+export HCLOUD_TOKEN=...
+export CLOUDFLARE_API_TOKEN=...
+terraform plan -var-file=tfvars/staging.tfvars
+terraform apply -var-file=tfvars/staging.tfvars
+
+# 4. Output gives you the VM IP. Copy the cloud env file into place:
+scp packages/cloud-shared/.env.local root@<vm-ip>:/opt/eliza/cloud/.env.local
+
+# 5. Trigger first deploy from GitHub Actions
+#    (workflow: deploy-eliza-provisioning-worker.yml, manual dispatch).
+```
+
+## Adopt the existing production VM into Terraform
+
+The current prod manager VM (`89.167.63.246`, hostname `milady`) was
+created by-hand via `packages/scripts/cloud/admin/bootstrap-provisioning-worker-host.mjs`
+on May 7th. To bring it under Terraform without recreating it:
+
+```bash
+# 1. Look up the Hetzner Cloud server ID:
+hcloud server list  # find the one with IP 89.167.63.246
+
+# 2. Import the existing resource into state:
+terraform init -backend-config=backend-production.hcl
+terraform import \
+  -var-file=tfvars/production.tfvars \
+  'hcloud_server.control_plane["1"]' \
+  <SERVER_ID>
+
+# 3. Run `terraform plan` and adjust variables until the diff is empty.
+#    Common drift: labels, ssh_keys (manually added during initial setup).
+```
+
+## What this module does NOT manage (yet)
+
+- Headscale state (preauth keys, ACLs) — manual via `headscale` CLI.
+- Cloudflared tunnels — config lives at `/root/.cloudflared/` on the VM and
+  is created via `cloudflared tunnel create` one-shot.
+- The systemd units — installed by `deploy-eliza-provisioning-worker.yml`
+  on every push.
+- The actual eliza Cloud sandbox cores (data plane) — runtime autoscale.
+
+These are tracked as TODOs in
+[`../ARCHITECTURE.md`](../ARCHITECTURE.md#followups).
+
+## Cost
+
+| Component                    | Resource    | Monthly (€) |
+|------------------------------|-------------|-------------|
+| 1× cpx21 (3 vCPU / 4 GB)     | control VM  | ~5          |
+| 1× IPv4 + IPv6               | floating IP | included    |
+| Cloudflare R2 state          | < 100 KB    | 0           |
+| **Total per environment**    |             | **~5**      |
+
+A 2nd control-plane VM (HA, currently unused) doubles the line. The
+**data-plane autoscale** cost is separate and elastic.
diff --git a/packages/cloud-infra/cloud/terraform/hetzner/control-plane/backend-production.hcl b/packages/cloud-infra/cloud/terraform/hetzner/control-plane/backend-production.hcl
@@ -0,0 +1,9 @@
+bucket                      = "eliza-terraform-state"
+key                         = "hetzner/control-plane/production.tfstate"
+region                      = "auto"
+endpoints                   = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" }
+skip_credentials_validation = true
+skip_metadata_api_check     = true
+skip_region_validation      = true
+skip_requesting_account_id  = true
+use_path_style              = true
diff --git a/packages/cloud-infra/cloud/terraform/hetzner/control-plane/backend-staging.hcl b/packages/cloud-infra/cloud/terraform/hetzner/control-plane/backend-staging.hcl
@@ -0,0 +1,20 @@
+# Cloudflare R2 backend (S3-compatible) for terraform state.
+#
+# Set up once per environment:
+#   1. Create R2 bucket `eliza-terraform-state` in the elizaOS CF account.
+#   2. Generate R2 API token with read/write on that bucket.
+#   3. Export AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY pointing at the
+#      R2 token before `terraform init`.
+#
+# Usage:
+#   terraform init -backend-config=backend-staging.hcl
+
+bucket                      = "eliza-terraform-state"
+key                         = "hetzner/control-plane/staging.tfstate"
+region                      = "auto"
+endpoints                   = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" }
+skip_credentials_validation = true
+skip_metadata_api_check     = true
+skip_region_validation      = true
+skip_requesting_account_id  = true
+use_path_style              = true
diff --git a/packages/cloud-infra/cloud/terraform/hetzner/control-plane/cloud-init/bootstrap.yaml.tftpl b/packages/cloud-infra/cloud/terraform/hetzner/control-plane/cloud-init/bootstrap.yaml.tftpl
@@ -0,0 +1,60 @@
+#cloud-config
+# Cloud-init bootstrap for eliza control-plane VMs.
+#
+# At first boot:
+#   1. Install system deps (docker, nginx, node, jq, postgresql-client, ssh)
+#   2. Add a `deploy` user that the GitHub Actions workflow
+#      `.github/workflows/deploy-eliza-provisioning-worker.yml` SSHes into
+#   3. Clone the monorepo on `${deploy_branch}` to /opt/eliza
+#   4. Install systemd units for `eliza-provisioning-worker` and
+#      `eliza-agent-router`
+#
+# Secrets (DATABASE_URL, KV_REST_API_*, STEWARD_*, HCLOUD_TOKEN, etc.) are NOT
+# baked in here. The operator copies `cloud/.env.local` to `/opt/eliza/cloud/.env.local`
+# in a follow-up step (one-shot, out-of-band) using `scp` or the existing
+# `packages/scripts/cloud/admin/bootstrap-provisioning-worker-host.mjs` script.
+
+hostname: ${hostname}
+manage_etc_hosts: true
+
+users:
+  - name: deploy
+    groups: sudo, docker
+    shell: /bin/bash
+    sudo: ALL=(ALL) NOPASSWD:ALL
+    lock_passwd: true
+
+package_update: true
+package_upgrade: false
+
+packages:
+  - curl
+  - git
+  - jq
+  - nginx
+  - postgresql-client
+  - rsync
+  - unzip
+
+write_files:
+  - path: /etc/profile.d/eliza-paths.sh
+    permissions: "0644"
+    content: |
+      export PATH="/home/deploy/.bun/bin:$PATH"
+      export ELIZA_DEPLOY_BRANCH="${deploy_branch}"
+
+runcmd:
+  # Docker (official convenience installer — pin sha if you want it
+  # auditable; we keep it simple for the bootstrap).
+  - curl -fsSL https://get.docker.com | sh
+  - systemctl enable --now docker
+
+  # Bun runtime for the deploy user (the daemons run under bun/tsx).
+  - su - deploy -c 'curl -fsSL https://bun.sh/install | bash -s "bun-v1.3.13"'
+
+  # Repo checkout — the GitHub deploy workflow then takes over for code updates.
+  - mkdir -p /opt/eliza && chown deploy:deploy /opt/eliza
+  - su - deploy -c 'git clone --depth 200 --branch ${deploy_branch} https://github.com/elizaOS/eliza.git /opt/eliza'
+
+  # Final marker for the bootstrap-warn check that runs after this.
+  - echo "eliza-control-plane bootstrap finished $(date -u +%Y-%m-%dT%H:%M:%SZ)" > /var/log/eliza-bootstrap.log
diff --git a/packages/cloud-infra/cloud/terraform/hetzner/control-plane/main.tf b/packages/cloud-infra/cloud/terraform/hetzner/control-plane/main.tf
@@ -0,0 +1,57 @@
+locals {
+  # Tags applied to every Hetzner Cloud resource managed here. Mirrors the
+  # data-plane convention (`managed-by: eliza-cloud`) used by the runtime
+  # autoscaler so a single search in the Hetzner Console reveals everything.
+  common_labels = {
+    "managed-by"  = "eliza-cloud"
+    "tier"        = "control-plane"
+    "environment" = var.environment
+  }
+}
+
+resource "hcloud_ssh_key" "operators" {
+  for_each = { for idx, key in var.ssh_public_keys : idx => key }
+
+  name       = "eliza-cp-${var.environment}-op-${each.key}"
+  public_key = each.value
+  labels     = local.common_labels
+}
+
+resource "hcloud_server" "control_plane" {
+  for_each = toset([for i in range(var.control_plane_count) : tostring(i + 1)])
+
+  name        = "eliza-cp-${var.environment}-${each.value}"
+  location    = var.hcloud_location
+  server_type = var.hcloud_server_type
+  image       = var.hcloud_image
+  ssh_keys    = [for k in hcloud_ssh_key.operators : k.id]
+  labels = merge(local.common_labels, {
+    "control-plane-index" = each.value
+  })
+
+  user_data = templatefile("${path.module}/cloud-init/bootstrap.yaml.tftpl", {
+    hostname      = "eliza-cp-${var.environment}-${each.value}"
+    deploy_branch = var.deploy_branch
+  })
+
+  # Keep server alive across refactors: changing labels or user_data
+  # shouldn't recreate the box, only update in place where possible.
+  lifecycle {
+    ignore_changes = [
+      user_data, # bootstrap runs once at first boot
+      image,     # updating image rebuilds — explicit `terraform taint` to opt in
+    ]
+  }
+}
+
+resource "cloudflare_dns_record" "control_plane" {
+  for_each = hcloud_server.control_plane
+
+  zone_id = var.cloudflare_zone_id
+  name    = "${var.control_plane_hostname_prefix}-${var.environment}-${each.key}.elizacloud.ai"
+  type    = "A"
+  content = each.value.ipv4_address
+  ttl     = 60
+  proxied = false
+  comment = "eliza control-plane VM ${each.value.name} (managed by terraform/hetzner/control-plane)"
+}