Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 124 additions & 0 deletions packages/cloud-infra/cloud/terraform/hetzner/ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Hetzner Control Plane vs Data Plane

eliza Cloud runs on a two-tier Hetzner Cloud setup. This doc nails down the
split so we stop treating manually-created VMs as "infrastructure-by-prayer".

## Layers

```
┌──────────────────────────────────────────────────────────────┐
│ Tier 1 — Control plane (static, 1-2 VMs, Terraform) │
│ │
│ eliza-1 (Hetzner cpx21, fsn1) │
│ ├── eliza-provisioning-worker (systemd, queue consumer)│
│ ├── eliza-agent-router (systemd, HTTP routing) │
│ ├── headscale (VPN mesh) │
│ ├── cloudflared tunnel (public ingress) │
│ ├── nginx (reverse proxy) │
│ └── (optional: grafana/prometheus) │
│ │
│ Lifecycle: long-lived. Replaced on demand, not autoscaled.│
│ Cost: ~€5/mo per VM (cpx21). │
└──────────────────────────────────────────────────────────────┘
│ enqueue / SSH
┌──────────────────────────────────────────────────────────────┐
│ Tier 2 — Data plane (elastic, N cores, runtime autoscale) │
│ │
│ eliza-core-<hex> (Hetzner cpx32, fsn1) │
│ ├── Docker daemon │
│ └── eliza-sandbox containers × N │
│ │
│ Lifecycle: created/drained by node-autoscaler.ts based on │
│ real demand. Server limit: ~25 (Hetzner default). │
│ Cost: elastic (~€11/mo per running cpx32). │
└──────────────────────────────────────────────────────────────┘
```

## Why two tiers

| Concern | Control plane | Data plane |
|----------------------|------------------------|---------------------------|
| **Provisioning** | Terraform (one-shot) | Runtime API (node-autoscaler.ts) |
| **Lifecycle** | Persistent | Ephemeral |
| **State** | Has local state (headscale DB, cloudflared creds) | Stateless |
| **Failure mode** | Page someone | Replace automatically |
| **Cost predictability** | Fixed monthly | Elastic |
| **What lives here** | Orchestrator, routing, monitoring | Just Docker + agents |

The split prevents the "control plane melts with the data plane during a
traffic spike" failure mode. Pulling sandboxes off the data plane is the
autoscaler's job; the orchestrator that issues drain commands must stay up
to coordinate it.

## Code ↔ infrastructure mapping

| Component | Code | Infra |
|---|---|---|
| Control plane VM | [`packages/scripts/cloud/admin/daemons/provisioning-worker.ts`](../../../../scripts/cloud/admin/daemons/provisioning-worker.ts) | [Terraform: `control-plane/`](./control-plane/) |
| Agent router | [`packages/scripts/cloud/admin/daemons/agent-router.ts`](../../../../scripts/cloud/admin/daemons/agent-router.ts) | systemd unit on control-plane VM |
| Data plane autoscaler | [`packages/cloud-shared/src/lib/services/containers/node-autoscaler.ts`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | Hetzner Cloud API at runtime |
| Sandbox provisioning | [`packages/cloud-shared/src/lib/services/docker-sandbox-provider.ts`](../../../../cloud-shared/src/lib/services/docker-sandbox-provider.ts) | SSH from control plane to data plane |

## Naming convention

| Layer | Prefix | Example | Where it's set |
|---|---|---|---|
| Control plane VM | `eliza-<n>` | `eliza-1` | Terraform `hcloud_server.control_plane` |
| Data plane node (NEW) | `eliza-core-<hex>` | `eliza-core-38ea87b1` | [`generateNodeId()`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) |
| Data plane node (LEGACY) | `milady-core-<n>` | `milady-core-1` | DEPRECATED — see [Legacy migration](#legacy-milady-core-migration) |

## Legacy `milady-core-*` migration

Pre-2026-05 the data plane was 6 manually-created `milady-core-*` VMs (Hetzner
cpx32 in fsn1) inserted by-hand into `docker_nodes` with `capacity = 100`.
By 2026-05-22:

- All 6 cores were `status: offline` (SSH health-check failing for weeks)
- Several user sandboxes still ran on the underlying Docker daemons
- The cloud autoscaler couldn't account for them

Migration 0132 (`0132_legacy_milady_cores_disable.sql`) flips them to
`enabled = false` + fixes `capacity = 8`. This:

1. Removes them from autoscaler capacity decisions
2. Stops the health-check noise
3. Lets the autoscaler spin up replacement `eliza-core-<hex>` nodes on demand

Existing sandboxes keep running until next restart. On user-triggered
restart / recreate, the daemon provisions them on a fresh autoscaled core.

Once `SELECT SUM(allocated_count) FROM docker_nodes WHERE node_id LIKE 'milady-core-%'`
is `0`, ops can:

```bash
# 1. Delete the Hetzner Cloud servers (one-time, via Hetzner console or):
hcloud server delete milady-core-1
hcloud server delete milady-core-2
# ... etc.

# 2. Drop the DB rows:
DELETE FROM docker_nodes WHERE node_id LIKE 'milady-core-%';
```

## Followups (not in this initial PR)

- [ ] Terraform module for headscale state (preauth keys, ACLs)
- [ ] Terraform module for the cloudflared tunnel (currently created by-hand)
- [ ] Terraform-apply GitHub workflow (`infra/**` path filter)
- [ ] Move the 4 remaining cron paths off the orphan
`container-control-plane` service onto the daemon-queue pattern
(`pool-replenish`, `pool-health-check`, `pool-image-rollout`,
`deployment-monitor`). Once done, retire the
`packages/cloud-services/container-control-plane/` package entirely.
- [ ] Raise Hetzner Cloud server limit (open ticket) to enable autoscale
past the default cap of ~10 servers per account.

## Operator runbook

See [`control-plane/README.md`](./control-plane/README.md)
for the step-by-step:

- Bootstrap a brand-new control-plane VM
- Import the existing production VM into Terraform
- Verify state, plan, apply
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
.terraform/
.terraform.lock.hcl
*.tfstate
*.tfstate.backup
crash.log
crash.*.log

# Live tfvars contain SSH public keys + zone IDs — keep them out of git.
# Only `.tfvars.example` lives in the repo.
*.tfvars
!*.tfvars.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# hetzner/control-plane — Terraform for the eliza Cloud control-plane VMs

This Terraform module manages the **persistent** Hetzner Cloud VM(s) that host
the elizaOS Cloud control-plane:

- `eliza-provisioning-worker` — pulls jobs from the `jobs` table and SSHs
into sandbox cores
- `eliza-agent-router` — subdomain HTTP routing
- `cloudflared` — secure tunnel for `sandboxes.waifu.fun`
- `headscale` — VPN mesh for cross-core agent traffic

The **data plane** (the sandbox cores themselves) is **not** managed here —
those are provisioned and drained at runtime by
[`node-autoscaler.ts`](../../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts)
which talks to the Hetzner Cloud API directly. See
[`../ARCHITECTURE.md`](../ARCHITECTURE.md) for the full split.

## Prerequisites

1. **Hetzner Cloud project** with API token (`HCLOUD_TOKEN`).
2. **Cloudflare account** with API token + DNS edit on `elizacloud.ai`
(`CLOUDFLARE_API_TOKEN`).
3. **Cloudflare R2 bucket** `eliza-terraform-state` for remote state. Generate
an R2 API token, edit `backend-staging.hcl` / `backend-production.hcl`
with your CF account ID, then export the R2 token as
`AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` before `terraform init`.
4. **Terraform >= 1.5.0** locally.

## Bootstrap a brand-new control-plane VM (staging)

```bash
cd packages/cloud-infra/cloud/terraform/hetzner/control-plane

# 1. Pull providers + connect remote state.
terraform init -backend-config=backend-staging.hcl

# 2. Copy + fill tfvars.
cp tfvars/staging.tfvars.example tfvars/staging.tfvars
$EDITOR tfvars/staging.tfvars

# 3. Plan + apply.
export HCLOUD_TOKEN=...
export CLOUDFLARE_API_TOKEN=...
terraform plan -var-file=tfvars/staging.tfvars
terraform apply -var-file=tfvars/staging.tfvars

# 4. Output gives you the VM IP. Copy the cloud env file into place:
scp packages/cloud-shared/.env.local root@<vm-ip>:/opt/eliza/cloud/.env.local

# 5. Trigger first deploy from GitHub Actions
# (workflow: deploy-eliza-provisioning-worker.yml, manual dispatch).
```

## Adopt the existing production VM into Terraform

The current prod manager VM (`89.167.63.246`, historical hostname
`milady`) was created by hand in May 2026. To bring it under Terraform
without recreating it, look up the Hetzner Cloud server ID
(`hcloud server list`), then `terraform import 'hcloud_server.control_plane["1"]' <id>`
plus a `terraform import` for each existing `hcloud_ssh_key`. The first
plan after import shows the in-place rename `milady → eliza-1`, the new
labels, and the Cloudflare DNS record creation; `user_data` and `image`
diffs are suppressed by `lifecycle { ignore_changes }`. One-shot — never
re-run.

## What this module does NOT manage (yet)

- Headscale state (preauth keys, ACLs) — manual via `headscale` CLI.
- Cloudflared tunnels — config lives at `/root/.cloudflared/` on the VM and
is created via `cloudflared tunnel create` one-shot.
- The systemd units — installed by `deploy-eliza-provisioning-worker.yml`
on every push.
- The actual eliza Cloud sandbox cores (data plane) — runtime autoscale.

These are tracked as TODOs in
[`../ARCHITECTURE.md`](../ARCHITECTURE.md#followups).

## Cost

| Component | Resource | Monthly (€) |
|------------------------------|-------------|-------------|
| 1× cpx21 (3 vCPU / 4 GB) | control VM | ~5 |
| 1× IPv4 + IPv6 | floating IP | included |
| Cloudflare R2 state | < 100 KB | 0 |
| **Total per environment** | | **~5** |

A 2nd control-plane VM (HA, currently unused) doubles the line. The
**data-plane autoscale** cost is separate and elastic.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
bucket = "eliza-terraform-state"
key = "hetzner/control-plane/production.tfstate"
region = "auto"
endpoints = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" }
skip_credentials_validation = true
skip_metadata_api_check = true
skip_region_validation = true
skip_requesting_account_id = true
use_path_style = true
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Cloudflare R2 backend (S3-compatible) for terraform state.
#
# Set up once per environment:
# 1. Create R2 bucket `eliza-terraform-state` in the elizaOS CF account.
# 2. Generate R2 API token with read/write on that bucket.
# 3. Export AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY pointing at the
# R2 token before `terraform init`.
#
# Usage:
# terraform init -backend-config=backend-staging.hcl

bucket = "eliza-terraform-state"
key = "hetzner/control-plane/staging.tfstate"
region = "auto"
endpoints = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" }
skip_credentials_validation = true
skip_metadata_api_check = true
skip_region_validation = true
skip_requesting_account_id = true
use_path_style = true
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#cloud-config
# Cloud-init bootstrap for eliza control-plane VMs.
#
# At first boot:
# 1. Install system deps (docker, nginx, node, jq, postgresql-client, ssh)
# 2. Add a `deploy` user that the GitHub Actions workflow
# `.github/workflows/deploy-eliza-provisioning-worker.yml` SSHes into
# 3. Clone the monorepo on `${deploy_branch}` to /opt/eliza
# 4. Install systemd units for `eliza-provisioning-worker` and
# `eliza-agent-router`
#
# Secrets (DATABASE_URL, KV_REST_API_*, STEWARD_*, HCLOUD_TOKEN, etc.) are NOT
# baked in here. The operator copies `cloud/.env.local` to `/opt/eliza/cloud/.env.local`
# in a follow-up step (one-shot, out-of-band) using `scp` or the existing
# `packages/scripts/cloud/admin/bootstrap-provisioning-worker-host.mjs` script.

hostname: ${hostname}
manage_etc_hosts: true

users:
- name: deploy
groups: sudo, docker
shell: /bin/bash
sudo: ALL=(ALL) NOPASSWD:ALL
lock_passwd: true
# Mirror operator SSH keys onto the deploy user. Hetzner only injects
# `hcloud_ssh_key` entries into root by default, but the
# `.github/workflows/deploy-eliza-provisioning-worker.yml` workflow SSHes
# in as `deploy`. Without this list the first auto-deploy would fail.
ssh_authorized_keys:
%{ for key in operator_ssh_keys ~}
- ${key}
%{ endfor ~}

Comment on lines +20 to +34
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 deploy user has no SSH authorized keys — GitHub Actions deploy workflow will be unable to connect

Hetzner's SSH key injection only populates root's ~/.ssh/authorized_keys. The deploy user is created with lock_passwd: true and no ssh_authorized_keys entry, making it unreachable via SSH. The README's deploy step triggers deploy-eliza-provisioning-worker.yml which presumably SSHes into this user — that will fail until keys are injected out-of-band.

package_update: true
package_upgrade: false

apt:
sources:
docker:
source: "deb [arch=amd64 signed-by=$KEY_FILE] https://download.docker.com/linux/ubuntu $RELEASE stable"
keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88

packages:
- curl
- docker-ce
- docker-ce-cli
- containerd.io
- git
- jq
- nginx
- postgresql-client
- rsync
- unzip

write_files:
- path: /etc/profile.d/eliza-paths.sh
permissions: "0644"
content: |
export PATH="/home/deploy/.bun/bin:$PATH"
export ELIZA_DEPLOY_BRANCH="${deploy_branch}"

runcmd:
# Docker is installed via the `apt` block above using the official
# Docker apt repo (GPG-verified keyring), not `curl get.docker.com | sh`.
- systemctl enable --now docker

# Bun runtime for the deploy user. We download a pinned release tarball
# AND its SHASUMS256 manifest from the same GitHub release, then verify
# the binary against the published hash before extracting. This avoids
# the `curl bun.sh/install | bash` supply-chain pattern flagged in the
# PR review — we still trust GitHub's HTTPS, but the manifest commits
# the publisher to a specific hash that an attacker can't forge mid-flight.
- install -d -o deploy -g deploy /home/deploy/.bun /home/deploy/.bun/bin
- su - deploy -c 'curl -fsSL -o /tmp/bun.zip https://github.com/oven-sh/bun/releases/download/bun-v1.3.13/bun-linux-x64.zip'
- su - deploy -c 'curl -fsSL -o /tmp/bun-SHASUMS256.txt https://github.com/oven-sh/bun/releases/download/bun-v1.3.13/SHASUMS256.txt'
- su - deploy -c 'cd /tmp && grep "bun-linux-x64.zip" bun-SHASUMS256.txt | sha256sum -c -'
- su - deploy -c 'cd /tmp && unzip -q bun.zip && install -m 0755 bun-linux-x64/bun /home/deploy/.bun/bin/bun && rm -rf /tmp/bun.zip /tmp/bun-linux-x64 /tmp/bun-SHASUMS256.txt'

# Repo checkout — the GitHub deploy workflow then takes over for code updates.
- mkdir -p /opt/eliza && chown deploy:deploy /opt/eliza
- su - deploy -c 'git clone --depth 200 --branch ${deploy_branch} https://github.com/elizaOS/eliza.git /opt/eliza'

# Final marker for the bootstrap-warn check that runs after this.
- echo "eliza-control-plane bootstrap finished $(date -u +%Y-%m-%dT%H:%M:%SZ)" > /var/log/eliza-bootstrap.log
Loading
Loading