Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 124 additions & 0 deletions packages/cloud-infra/cloud/terraform/hetzner/ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Hetzner Control Plane vs Data Plane

eliza Cloud runs on a two-tier Hetzner Cloud setup. This doc nails down the
split so we stop treating manually-created VMs as "infrastructure-by-prayer".

## Layers

```
┌──────────────────────────────────────────────────────────────┐
│ Tier 1 — Control plane (static, 1-2 VMs, Terraform) │
│ │
│ eliza-cp-production-1 (Hetzner cpx21, fsn1) │
│ ├── eliza-provisioning-worker (systemd, queue consumer)│
│ ├── eliza-agent-router (systemd, HTTP routing) │
│ ├── headscale (VPN mesh) │
│ ├── cloudflared tunnel (public ingress) │
│ ├── nginx (reverse proxy) │
│ └── (optional: grafana/prometheus) │
│ │
│ Lifecycle: long-lived. Replaced on demand, not autoscaled.│
│ Cost: ~€5/mo per VM (cpx21). │
└──────────────────────────────────────────────────────────────┘
│ enqueue / SSH
┌──────────────────────────────────────────────────────────────┐
│ Tier 2 — Data plane (elastic, N cores, runtime autoscale) │
│ │
│ eliza-core-<hex> (Hetzner cpx32, fsn1) │
│ ├── Docker daemon │
│ └── eliza-sandbox containers × N │
│ │
│ Lifecycle: created/drained by node-autoscaler.ts based on │
│ real demand. Server limit: ~25 (Hetzner default). │
│ Cost: elastic (~€11/mo per running cpx32). │
└──────────────────────────────────────────────────────────────┘
```

## Why two tiers

| Concern | Control plane | Data plane |
|----------------------|------------------------|---------------------------|
| **Provisioning** | Terraform (one-shot) | Runtime API (node-autoscaler.ts) |
| **Lifecycle** | Persistent | Ephemeral |
| **State** | Has local state (headscale DB, cloudflared creds) | Stateless |
| **Failure mode** | Page someone | Replace automatically |
| **Cost predictability** | Fixed monthly | Elastic |
| **What lives here** | Orchestrator, routing, monitoring | Just Docker + agents |

The split prevents the "control plane melts with the data plane during a
traffic spike" failure mode. Pulling sandboxes off the data plane is the
autoscaler's job; the orchestrator that issues drain commands must stay up
to coordinate it.

## Code ↔ infrastructure mapping

| Component | Code | Infra |
|---|---|---|
| Control plane VM | [`packages/scripts/cloud/admin/daemons/provisioning-worker.ts`](../../../../scripts/cloud/admin/daemons/provisioning-worker.ts) | [Terraform: `control-plane/`](./control-plane/) |
| Agent router | [`packages/scripts/cloud/admin/daemons/agent-router.ts`](../../../../scripts/cloud/admin/daemons/agent-router.ts) | systemd unit on control-plane VM |
| Data plane autoscaler | [`packages/cloud-shared/src/lib/services/containers/node-autoscaler.ts`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | Hetzner Cloud API at runtime |
| Sandbox provisioning | [`packages/cloud-shared/src/lib/services/docker-sandbox-provider.ts`](../../../../cloud-shared/src/lib/services/docker-sandbox-provider.ts) | SSH from control plane to data plane |

## Naming convention

| Layer | Prefix | Example | Where it's set |
|---|---|---|---|
| Control plane VM | `eliza-cp-<env>-<n>` | `eliza-cp-production-1` | Terraform `var.environment` |
| Data plane node (NEW) | `eliza-core-<hex>` | `eliza-core-38ea87b1` | [`generateNodeId()`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) |
| Data plane node (LEGACY) | `milady-core-<n>` | `milady-core-1` | DEPRECATED — see [Legacy migration](#legacy-milady-core-migration) |

## Legacy `milady-core-*` migration

Pre-2026-05 the data plane was 6 manually-created `milady-core-*` VMs (Hetzner
cpx32 in fsn1) inserted by-hand into `docker_nodes` with `capacity = 100`.
By 2026-05-22:

- All 6 cores were `status: offline` (SSH health-check failing for weeks)
- Several user sandboxes still ran on the underlying Docker daemons
- The cloud autoscaler couldn't account for them

Migration 0132 (`0132_legacy_milady_cores_disable.sql`) flips them to
`enabled = false` + fixes `capacity = 8`. This:

1. Removes them from autoscaler capacity decisions
2. Stops the health-check noise
3. Lets the autoscaler spin up replacement `eliza-core-<hex>` nodes on demand

Existing sandboxes keep running until next restart. On user-triggered
restart / recreate, the daemon provisions them on a fresh autoscaled core.

Once `SELECT SUM(allocated_count) FROM docker_nodes WHERE node_id LIKE 'milady-core-%'`
is `0`, ops can:

```bash
# 1. Delete the Hetzner Cloud servers (one-time, via Hetzner console or):
hcloud server delete milady-core-1
hcloud server delete milady-core-2
# ... etc.

# 2. Drop the DB rows:
DELETE FROM docker_nodes WHERE node_id LIKE 'milady-core-%';
```

## Followups (not in this initial PR)

- [ ] Terraform module for headscale state (preauth keys, ACLs)
- [ ] Terraform module for the cloudflared tunnel (currently created by-hand)
- [ ] Terraform-apply GitHub workflow (`infra/**` path filter)
- [ ] Move the 4 remaining cron paths off the orphan
`container-control-plane` service onto the daemon-queue pattern
(`pool-replenish`, `pool-health-check`, `pool-image-rollout`,
`deployment-monitor`). Once done, retire the
`packages/cloud-services/container-control-plane/` package entirely.
- [ ] Raise Hetzner Cloud server limit (open ticket) to enable autoscale
past the default cap of ~10 servers per account.

## Operator runbook

See [`control-plane/README.md`](./control-plane/README.md)
for the step-by-step:

- Bootstrap a brand-new control-plane VM
- Import the existing production VM into Terraform
- Verify state, plan, apply
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
.terraform/
.terraform.lock.hcl
*.tfstate
*.tfstate.backup
crash.log
crash.*.log

# Live tfvars contain SSH public keys + zone IDs — keep them out of git.
# Only `.tfvars.example` lives in the repo.
*.tfvars
!*.tfvars.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# hetzner/control-plane — Terraform for the eliza Cloud control-plane VMs

This Terraform module manages the **persistent** Hetzner Cloud VM(s) that host
the elizaOS Cloud control-plane:

- `eliza-provisioning-worker` — pulls jobs from the `jobs` table and SSHs
into sandbox cores
- `eliza-agent-router` — subdomain HTTP routing
- `cloudflared` — secure tunnel for `sandboxes.waifu.fun`
- `headscale` — VPN mesh for cross-core agent traffic

The **data plane** (the sandbox cores themselves) is **not** managed here —
those are provisioned and drained at runtime by
[`node-autoscaler.ts`](../../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts)
which talks to the Hetzner Cloud API directly. See
[`../ARCHITECTURE.md`](../ARCHITECTURE.md) for the full split.

## Prerequisites

1. **Hetzner Cloud project** with API token (`HCLOUD_TOKEN`).
2. **Cloudflare account** with API token + DNS edit on `elizacloud.ai`
(`CLOUDFLARE_API_TOKEN`).
3. **Cloudflare R2 bucket** `eliza-terraform-state` for remote state. Generate
an R2 API token, edit `backend-staging.hcl` / `backend-production.hcl`
with your CF account ID, then export the R2 token as
`AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` before `terraform init`.
4. **Terraform >= 1.5.0** locally.

## Bootstrap a brand-new control-plane VM (staging)

```bash
cd packages/cloud-infra/cloud/terraform/hetzner/control-plane

# 1. Pull providers + connect remote state.
terraform init -backend-config=backend-staging.hcl

# 2. Copy + fill tfvars.
cp tfvars/staging.tfvars.example tfvars/staging.tfvars
$EDITOR tfvars/staging.tfvars

# 3. Plan + apply.
export HCLOUD_TOKEN=...
export CLOUDFLARE_API_TOKEN=...
terraform plan -var-file=tfvars/staging.tfvars
terraform apply -var-file=tfvars/staging.tfvars

# 4. Output gives you the VM IP. Copy the cloud env file into place:
scp packages/cloud-shared/.env.local root@<vm-ip>:/opt/eliza/cloud/.env.local

# 5. Trigger first deploy from GitHub Actions
# (workflow: deploy-eliza-provisioning-worker.yml, manual dispatch).
```

## Adopt the existing production VM into Terraform

The current prod manager VM (`89.167.63.246`, hostname `milady`) was
created by-hand via `packages/scripts/cloud/admin/bootstrap-provisioning-worker-host.mjs`
on May 7th. To bring it under Terraform without recreating it:

```bash
# 1. Look up the Hetzner Cloud server ID:
hcloud server list # find the one with IP 89.167.63.246

# 2. Import the existing resource into state:
terraform init -backend-config=backend-production.hcl
terraform import \
-var-file=tfvars/production.tfvars \
'hcloud_server.control_plane["1"]' \
<SERVER_ID>

# 3. Run `terraform plan` and adjust variables until the diff is empty.
# Common drift: labels, ssh_keys (manually added during initial setup).
```

## What this module does NOT manage (yet)

- Headscale state (preauth keys, ACLs) — manual via `headscale` CLI.
- Cloudflared tunnels — config lives at `/root/.cloudflared/` on the VM and
is created via `cloudflared tunnel create` one-shot.
- The systemd units — installed by `deploy-eliza-provisioning-worker.yml`
on every push.
- The actual eliza Cloud sandbox cores (data plane) — runtime autoscale.

These are tracked as TODOs in
[`../ARCHITECTURE.md`](../ARCHITECTURE.md#followups).

## Cost

| Component | Resource | Monthly (€) |
|------------------------------|-------------|-------------|
| 1× cpx21 (3 vCPU / 4 GB) | control VM | ~5 |
| 1× IPv4 + IPv6 | floating IP | included |
| Cloudflare R2 state | < 100 KB | 0 |
| **Total per environment** | | **~5** |

A 2nd control-plane VM (HA, currently unused) doubles the line. The
**data-plane autoscale** cost is separate and elastic.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
bucket = "eliza-terraform-state"
key = "hetzner/control-plane/production.tfstate"
region = "auto"
endpoints = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" }
skip_credentials_validation = true
skip_metadata_api_check = true
skip_region_validation = true
skip_requesting_account_id = true
use_path_style = true
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Cloudflare R2 backend (S3-compatible) for terraform state.
#
# Set up once per environment:
# 1. Create R2 bucket `eliza-terraform-state` in the elizaOS CF account.
# 2. Generate R2 API token with read/write on that bucket.
# 3. Export AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY pointing at the
# R2 token before `terraform init`.
#
# Usage:
# terraform init -backend-config=backend-staging.hcl

bucket = "eliza-terraform-state"
key = "hetzner/control-plane/staging.tfstate"
region = "auto"
endpoints = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" }
skip_credentials_validation = true
skip_metadata_api_check = true
skip_region_validation = true
skip_requesting_account_id = true
use_path_style = true
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
#cloud-config
# Cloud-init bootstrap for eliza control-plane VMs.
#
# At first boot:
# 1. Install system deps (docker, nginx, node, jq, postgresql-client, ssh)
# 2. Add a `deploy` user that the GitHub Actions workflow
# `.github/workflows/deploy-eliza-provisioning-worker.yml` SSHes into
# 3. Clone the monorepo on `${deploy_branch}` to /opt/eliza
# 4. Install systemd units for `eliza-provisioning-worker` and
# `eliza-agent-router`
#
# Secrets (DATABASE_URL, KV_REST_API_*, STEWARD_*, HCLOUD_TOKEN, etc.) are NOT
# baked in here. The operator copies `cloud/.env.local` to `/opt/eliza/cloud/.env.local`
# in a follow-up step (one-shot, out-of-band) using `scp` or the existing
# `packages/scripts/cloud/admin/bootstrap-provisioning-worker-host.mjs` script.

hostname: ${hostname}
manage_etc_hosts: true

users:
- name: deploy
groups: sudo, docker
shell: /bin/bash
sudo: ALL=(ALL) NOPASSWD:ALL
lock_passwd: true

Comment on lines +20 to +34
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 deploy user has no SSH authorized keys — GitHub Actions deploy workflow will be unable to connect

Hetzner's SSH key injection only populates root's ~/.ssh/authorized_keys. The deploy user is created with lock_passwd: true and no ssh_authorized_keys entry, making it unreachable via SSH. The README's deploy step triggers deploy-eliza-provisioning-worker.yml which presumably SSHes into this user — that will fail until keys are injected out-of-band.

package_update: true
package_upgrade: false

packages:
- curl
- git
- jq
- nginx
- postgresql-client
- rsync
- unzip

write_files:
- path: /etc/profile.d/eliza-paths.sh
permissions: "0644"
content: |
export PATH="/home/deploy/.bun/bin:$PATH"
export ELIZA_DEPLOY_BRANCH="${deploy_branch}"

runcmd:
# Docker (official convenience installer — pin sha if you want it
# auditable; we keep it simple for the bootstrap).
- curl -fsSL https://get.docker.com | sh
- systemctl enable --now docker

# Bun runtime for the deploy user (the daemons run under bun/tsx).
- su - deploy -c 'curl -fsSL https://bun.sh/install | bash -s "bun-v1.3.13"'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 security Unverified curl | sh installs on the control-plane VM

Both Docker and Bun installs pipe remote scripts into a shell without checksum verification. This VM holds DATABASE_URL, HCLOUD_TOKEN, Headscale state, and the cloudflared tunnel — a higher-value target than a data-plane node. A supply-chain or MITM attack at bootstrap time would silently compromise the entire control plane.


# Repo checkout — the GitHub deploy workflow then takes over for code updates.
- mkdir -p /opt/eliza && chown deploy:deploy /opt/eliza
- su - deploy -c 'git clone --depth 200 --branch ${deploy_branch} https://github.com/elizaOS/eliza.git /opt/eliza'

# Final marker for the bootstrap-warn check that runs after this.
- echo "eliza-control-plane bootstrap finished $(date -u +%Y-%m-%dT%H:%M:%SZ)" > /var/log/eliza-bootstrap.log
57 changes: 57 additions & 0 deletions packages/cloud-infra/cloud/terraform/hetzner/control-plane/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
locals {
# Tags applied to every Hetzner Cloud resource managed here. Mirrors the
# data-plane convention (`managed-by: eliza-cloud`) used by the runtime
# autoscaler so a single search in the Hetzner Console reveals everything.
common_labels = {
"managed-by" = "eliza-cloud"
"tier" = "control-plane"
"environment" = var.environment
}
}

resource "hcloud_ssh_key" "operators" {
for_each = { for idx, key in var.ssh_public_keys : idx => key }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Positional list indexing causes unnecessary key churn on reorder/insert

{ for idx, key in var.ssh_public_keys : idx => key } maps list position to the Hetzner SSH key resource address. Inserting a key before the last position shifts every subsequent key's each.key, causing Terraform to plan renames or destroy+recreates of downstream SSH key objects.


name = "eliza-cp-${var.environment}-op-${each.key}"
public_key = each.value
labels = local.common_labels
}

resource "hcloud_server" "control_plane" {
for_each = toset([for i in range(var.control_plane_count) : tostring(i + 1)])

name = "eliza-cp-${var.environment}-${each.value}"
location = var.hcloud_location
server_type = var.hcloud_server_type
image = var.hcloud_image
ssh_keys = [for k in hcloud_ssh_key.operators : k.id]
labels = merge(local.common_labels, {
"control-plane-index" = each.value
})

user_data = templatefile("${path.module}/cloud-init/bootstrap.yaml.tftpl", {
hostname = "eliza-cp-${var.environment}-${each.value}"
deploy_branch = var.deploy_branch
})

# Keep server alive across refactors: changing labels or user_data
# shouldn't recreate the box, only update in place where possible.
lifecycle {
ignore_changes = [
user_data, # bootstrap runs once at first boot
image, # updating image rebuilds — explicit `terraform taint` to opt in
]
}
}

resource "cloudflare_dns_record" "control_plane" {
for_each = hcloud_server.control_plane

zone_id = var.cloudflare_zone_id
name = "${var.control_plane_hostname_prefix}-${var.environment}-${each.key}.elizacloud.ai"
type = "A"
content = each.value.ipv4_address
ttl = 60
proxied = false
comment = "eliza control-plane VM ${each.value.name} (managed by terraform/hetzner/control-plane)"
}
Loading
Loading