-
Notifications
You must be signed in to change notification settings - Fork 5.5k
feat(cloud): Hetzner control plane IaC + data plane naming + legacy milady-core deprecation #7890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,124 @@ | ||
| # Hetzner Control Plane vs Data Plane | ||
|
|
||
| eliza Cloud runs on a two-tier Hetzner Cloud setup. This doc nails down the | ||
| split so we stop treating manually-created VMs as "infrastructure-by-prayer". | ||
|
|
||
| ## Layers | ||
|
|
||
| ``` | ||
| ┌──────────────────────────────────────────────────────────────┐ | ||
| │ Tier 1 — Control plane (static, 1-2 VMs, Terraform) │ | ||
| │ │ | ||
| │ eliza-cp-production-1 (Hetzner cpx21, fsn1) │ | ||
| │ ├── eliza-provisioning-worker (systemd, queue consumer)│ | ||
| │ ├── eliza-agent-router (systemd, HTTP routing) │ | ||
| │ ├── headscale (VPN mesh) │ | ||
| │ ├── cloudflared tunnel (public ingress) │ | ||
| │ ├── nginx (reverse proxy) │ | ||
| │ └── (optional: grafana/prometheus) │ | ||
| │ │ | ||
| │ Lifecycle: long-lived. Replaced on demand, not autoscaled.│ | ||
| │ Cost: ~€5/mo per VM (cpx21). │ | ||
| └──────────────────────────────────────────────────────────────┘ | ||
| │ enqueue / SSH | ||
| ▼ | ||
| ┌──────────────────────────────────────────────────────────────┐ | ||
| │ Tier 2 — Data plane (elastic, N cores, runtime autoscale) │ | ||
| │ │ | ||
| │ eliza-core-<hex> (Hetzner cpx32, fsn1) │ | ||
| │ ├── Docker daemon │ | ||
| │ └── eliza-sandbox containers × N │ | ||
| │ │ | ||
| │ Lifecycle: created/drained by node-autoscaler.ts based on │ | ||
| │ real demand. Server limit: ~25 (Hetzner default). │ | ||
| │ Cost: elastic (~€11/mo per running cpx32). │ | ||
| └──────────────────────────────────────────────────────────────┘ | ||
| ``` | ||
|
|
||
| ## Why two tiers | ||
|
|
||
| | Concern | Control plane | Data plane | | ||
| |----------------------|------------------------|---------------------------| | ||
| | **Provisioning** | Terraform (one-shot) | Runtime API (node-autoscaler.ts) | | ||
| | **Lifecycle** | Persistent | Ephemeral | | ||
| | **State** | Has local state (headscale DB, cloudflared creds) | Stateless | | ||
| | **Failure mode** | Page someone | Replace automatically | | ||
| | **Cost predictability** | Fixed monthly | Elastic | | ||
| | **What lives here** | Orchestrator, routing, monitoring | Just Docker + agents | | ||
|
|
||
| The split prevents the "control plane melts with the data plane during a | ||
| traffic spike" failure mode. Pulling sandboxes off the data plane is the | ||
| autoscaler's job; the orchestrator that issues drain commands must stay up | ||
| to coordinate it. | ||
|
|
||
| ## Code ↔ infrastructure mapping | ||
|
|
||
| | Component | Code | Infra | | ||
| |---|---|---| | ||
| | Control plane VM | [`packages/scripts/cloud/admin/daemons/provisioning-worker.ts`](../../../../scripts/cloud/admin/daemons/provisioning-worker.ts) | [Terraform: `control-plane/`](./control-plane/) | | ||
| | Agent router | [`packages/scripts/cloud/admin/daemons/agent-router.ts`](../../../../scripts/cloud/admin/daemons/agent-router.ts) | systemd unit on control-plane VM | | ||
| | Data plane autoscaler | [`packages/cloud-shared/src/lib/services/containers/node-autoscaler.ts`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | Hetzner Cloud API at runtime | | ||
| | Sandbox provisioning | [`packages/cloud-shared/src/lib/services/docker-sandbox-provider.ts`](../../../../cloud-shared/src/lib/services/docker-sandbox-provider.ts) | SSH from control plane to data plane | | ||
|
|
||
| ## Naming convention | ||
|
|
||
| | Layer | Prefix | Example | Where it's set | | ||
| |---|---|---|---| | ||
| | Control plane VM | `eliza-cp-<env>-<n>` | `eliza-cp-production-1` | Terraform `var.environment` | | ||
| | Data plane node (NEW) | `eliza-core-<hex>` | `eliza-core-38ea87b1` | [`generateNodeId()`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | | ||
| | Data plane node (LEGACY) | `milady-core-<n>` | `milady-core-1` | DEPRECATED — see [Legacy migration](#legacy-milady-core-migration) | | ||
|
|
||
| ## Legacy `milady-core-*` migration | ||
|
|
||
| Pre-2026-05 the data plane was 6 manually-created `milady-core-*` VMs (Hetzner | ||
| cpx32 in fsn1) inserted by-hand into `docker_nodes` with `capacity = 100`. | ||
| By 2026-05-22: | ||
|
|
||
| - All 6 cores were `status: offline` (SSH health-check failing for weeks) | ||
| - Several user sandboxes still ran on the underlying Docker daemons | ||
| - The cloud autoscaler couldn't account for them | ||
|
|
||
| Migration 0132 (`0132_legacy_milady_cores_disable.sql`) flips them to | ||
| `enabled = false` + fixes `capacity = 8`. This: | ||
|
|
||
| 1. Removes them from autoscaler capacity decisions | ||
| 2. Stops the health-check noise | ||
| 3. Lets the autoscaler spin up replacement `eliza-core-<hex>` nodes on demand | ||
|
|
||
| Existing sandboxes keep running until next restart. On user-triggered | ||
| restart / recreate, the daemon provisions them on a fresh autoscaled core. | ||
|
|
||
| Once `SELECT SUM(allocated_count) FROM docker_nodes WHERE node_id LIKE 'milady-core-%'` | ||
| is `0`, ops can: | ||
|
|
||
| ```bash | ||
| # 1. Delete the Hetzner Cloud servers (one-time, via Hetzner console or): | ||
| hcloud server delete milady-core-1 | ||
| hcloud server delete milady-core-2 | ||
| # ... etc. | ||
|
|
||
| # 2. Drop the DB rows: | ||
| DELETE FROM docker_nodes WHERE node_id LIKE 'milady-core-%'; | ||
| ``` | ||
|
|
||
| ## Followups (not in this initial PR) | ||
|
|
||
| - [ ] Terraform module for headscale state (preauth keys, ACLs) | ||
| - [ ] Terraform module for the cloudflared tunnel (currently created by-hand) | ||
| - [ ] Terraform-apply GitHub workflow (`infra/**` path filter) | ||
| - [ ] Move the 4 remaining cron paths off the orphan | ||
| `container-control-plane` service onto the daemon-queue pattern | ||
| (`pool-replenish`, `pool-health-check`, `pool-image-rollout`, | ||
| `deployment-monitor`). Once done, retire the | ||
| `packages/cloud-services/container-control-plane/` package entirely. | ||
| - [ ] Raise Hetzner Cloud server limit (open ticket) to enable autoscale | ||
| past the default cap of ~10 servers per account. | ||
|
|
||
| ## Operator runbook | ||
|
|
||
| See [`control-plane/README.md`](./control-plane/README.md) | ||
| for the step-by-step: | ||
|
|
||
| - Bootstrap a brand-new control-plane VM | ||
| - Import the existing production VM into Terraform | ||
| - Verify state, plan, apply |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| .terraform/ | ||
| .terraform.lock.hcl | ||
| *.tfstate | ||
| *.tfstate.backup | ||
| crash.log | ||
| crash.*.log | ||
|
|
||
| # Live tfvars contain SSH public keys + zone IDs — keep them out of git. | ||
| # Only `.tfvars.example` lives in the repo. | ||
| *.tfvars | ||
| !*.tfvars.example |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| # hetzner/control-plane — Terraform for the eliza Cloud control-plane VMs | ||
|
|
||
| This Terraform module manages the **persistent** Hetzner Cloud VM(s) that host | ||
| the elizaOS Cloud control-plane: | ||
|
|
||
| - `eliza-provisioning-worker` — pulls jobs from the `jobs` table and SSHs | ||
| into sandbox cores | ||
| - `eliza-agent-router` — subdomain HTTP routing | ||
| - `cloudflared` — secure tunnel for `sandboxes.waifu.fun` | ||
| - `headscale` — VPN mesh for cross-core agent traffic | ||
|
|
||
| The **data plane** (the sandbox cores themselves) is **not** managed here — | ||
| those are provisioned and drained at runtime by | ||
| [`node-autoscaler.ts`](../../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | ||
| which talks to the Hetzner Cloud API directly. See | ||
| [`../ARCHITECTURE.md`](../ARCHITECTURE.md) for the full split. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| 1. **Hetzner Cloud project** with API token (`HCLOUD_TOKEN`). | ||
| 2. **Cloudflare account** with API token + DNS edit on `elizacloud.ai` | ||
| (`CLOUDFLARE_API_TOKEN`). | ||
| 3. **Cloudflare R2 bucket** `eliza-terraform-state` for remote state. Generate | ||
| an R2 API token, edit `backend-staging.hcl` / `backend-production.hcl` | ||
| with your CF account ID, then export the R2 token as | ||
| `AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` before `terraform init`. | ||
| 4. **Terraform >= 1.5.0** locally. | ||
|
|
||
| ## Bootstrap a brand-new control-plane VM (staging) | ||
|
|
||
| ```bash | ||
| cd packages/cloud-infra/cloud/terraform/hetzner/control-plane | ||
|
|
||
| # 1. Pull providers + connect remote state. | ||
| terraform init -backend-config=backend-staging.hcl | ||
|
|
||
| # 2. Copy + fill tfvars. | ||
| cp tfvars/staging.tfvars.example tfvars/staging.tfvars | ||
| $EDITOR tfvars/staging.tfvars | ||
|
|
||
| # 3. Plan + apply. | ||
| export HCLOUD_TOKEN=... | ||
| export CLOUDFLARE_API_TOKEN=... | ||
| terraform plan -var-file=tfvars/staging.tfvars | ||
| terraform apply -var-file=tfvars/staging.tfvars | ||
|
|
||
| # 4. Output gives you the VM IP. Copy the cloud env file into place: | ||
| scp packages/cloud-shared/.env.local root@<vm-ip>:/opt/eliza/cloud/.env.local | ||
|
|
||
| # 5. Trigger first deploy from GitHub Actions | ||
| # (workflow: deploy-eliza-provisioning-worker.yml, manual dispatch). | ||
| ``` | ||
|
|
||
| ## Adopt the existing production VM into Terraform | ||
|
|
||
| The current prod manager VM (`89.167.63.246`, hostname `milady`) was | ||
| created by-hand via `packages/scripts/cloud/admin/bootstrap-provisioning-worker-host.mjs` | ||
| on May 7th. To bring it under Terraform without recreating it: | ||
|
|
||
| ```bash | ||
| # 1. Look up the Hetzner Cloud server ID: | ||
| hcloud server list # find the one with IP 89.167.63.246 | ||
|
|
||
| # 2. Import the existing resource into state: | ||
| terraform init -backend-config=backend-production.hcl | ||
| terraform import \ | ||
| -var-file=tfvars/production.tfvars \ | ||
| 'hcloud_server.control_plane["1"]' \ | ||
| <SERVER_ID> | ||
|
|
||
| # 3. Run `terraform plan` and adjust variables until the diff is empty. | ||
| # Common drift: labels, ssh_keys (manually added during initial setup). | ||
| ``` | ||
|
|
||
| ## What this module does NOT manage (yet) | ||
|
|
||
| - Headscale state (preauth keys, ACLs) — manual via `headscale` CLI. | ||
| - Cloudflared tunnels — config lives at `/root/.cloudflared/` on the VM and | ||
| is created via `cloudflared tunnel create` one-shot. | ||
| - The systemd units — installed by `deploy-eliza-provisioning-worker.yml` | ||
| on every push. | ||
| - The actual eliza Cloud sandbox cores (data plane) — runtime autoscale. | ||
|
|
||
| These are tracked as TODOs in | ||
| [`../ARCHITECTURE.md`](../ARCHITECTURE.md#followups). | ||
|
|
||
| ## Cost | ||
|
|
||
| | Component | Resource | Monthly (€) | | ||
| |------------------------------|-------------|-------------| | ||
| | 1× cpx21 (3 vCPU / 4 GB) | control VM | ~5 | | ||
| | 1× IPv4 + IPv6 | floating IP | included | | ||
| | Cloudflare R2 state | < 100 KB | 0 | | ||
| | **Total per environment** | | **~5** | | ||
|
|
||
| A 2nd control-plane VM (HA, currently unused) doubles the line. The | ||
| **data-plane autoscale** cost is separate and elastic. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| bucket = "eliza-terraform-state" | ||
| key = "hetzner/control-plane/production.tfstate" | ||
| region = "auto" | ||
| endpoints = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" } | ||
| skip_credentials_validation = true | ||
| skip_metadata_api_check = true | ||
| skip_region_validation = true | ||
| skip_requesting_account_id = true | ||
| use_path_style = true |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| # Cloudflare R2 backend (S3-compatible) for terraform state. | ||
| # | ||
| # Set up once per environment: | ||
| # 1. Create R2 bucket `eliza-terraform-state` in the elizaOS CF account. | ||
| # 2. Generate R2 API token with read/write on that bucket. | ||
| # 3. Export AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY pointing at the | ||
| # R2 token before `terraform init`. | ||
| # | ||
| # Usage: | ||
| # terraform init -backend-config=backend-staging.hcl | ||
|
|
||
| bucket = "eliza-terraform-state" | ||
| key = "hetzner/control-plane/staging.tfstate" | ||
| region = "auto" | ||
| endpoints = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" } | ||
| skip_credentials_validation = true | ||
| skip_metadata_api_check = true | ||
| skip_region_validation = true | ||
| skip_requesting_account_id = true | ||
| use_path_style = true |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| #cloud-config | ||
| # Cloud-init bootstrap for eliza control-plane VMs. | ||
| # | ||
| # At first boot: | ||
| # 1. Install system deps (docker, nginx, node, jq, postgresql-client, ssh) | ||
| # 2. Add a `deploy` user that the GitHub Actions workflow | ||
| # `.github/workflows/deploy-eliza-provisioning-worker.yml` SSHes into | ||
| # 3. Clone the monorepo on `${deploy_branch}` to /opt/eliza | ||
| # 4. Install systemd units for `eliza-provisioning-worker` and | ||
| # `eliza-agent-router` | ||
| # | ||
| # Secrets (DATABASE_URL, KV_REST_API_*, STEWARD_*, HCLOUD_TOKEN, etc.) are NOT | ||
| # baked in here. The operator copies `cloud/.env.local` to `/opt/eliza/cloud/.env.local` | ||
| # in a follow-up step (one-shot, out-of-band) using `scp` or the existing | ||
| # `packages/scripts/cloud/admin/bootstrap-provisioning-worker-host.mjs` script. | ||
|
|
||
| hostname: ${hostname} | ||
| manage_etc_hosts: true | ||
|
|
||
| users: | ||
| - name: deploy | ||
| groups: sudo, docker | ||
| shell: /bin/bash | ||
| sudo: ALL=(ALL) NOPASSWD:ALL | ||
| lock_passwd: true | ||
|
|
||
| package_update: true | ||
| package_upgrade: false | ||
|
|
||
| packages: | ||
| - curl | ||
| - git | ||
| - jq | ||
| - nginx | ||
| - postgresql-client | ||
| - rsync | ||
| - unzip | ||
|
|
||
| write_files: | ||
| - path: /etc/profile.d/eliza-paths.sh | ||
| permissions: "0644" | ||
| content: | | ||
| export PATH="/home/deploy/.bun/bin:$PATH" | ||
| export ELIZA_DEPLOY_BRANCH="${deploy_branch}" | ||
|
|
||
| runcmd: | ||
| # Docker (official convenience installer — pin sha if you want it | ||
| # auditable; we keep it simple for the bootstrap). | ||
| - curl -fsSL https://get.docker.com | sh | ||
| - systemctl enable --now docker | ||
|
|
||
| # Bun runtime for the deploy user (the daemons run under bun/tsx). | ||
| - su - deploy -c 'curl -fsSL https://bun.sh/install | bash -s "bun-v1.3.13"' | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Both Docker and Bun installs pipe remote scripts into a shell without checksum verification. This VM holds |
||
|
|
||
| # Repo checkout — the GitHub deploy workflow then takes over for code updates. | ||
| - mkdir -p /opt/eliza && chown deploy:deploy /opt/eliza | ||
| - su - deploy -c 'git clone --depth 200 --branch ${deploy_branch} https://github.com/elizaOS/eliza.git /opt/eliza' | ||
|
|
||
| # Final marker for the bootstrap-warn check that runs after this. | ||
| - echo "eliza-control-plane bootstrap finished $(date -u +%Y-%m-%dT%H:%M:%SZ)" > /var/log/eliza-bootstrap.log | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| locals { | ||
| # Tags applied to every Hetzner Cloud resource managed here. Mirrors the | ||
| # data-plane convention (`managed-by: eliza-cloud`) used by the runtime | ||
| # autoscaler so a single search in the Hetzner Console reveals everything. | ||
| common_labels = { | ||
| "managed-by" = "eliza-cloud" | ||
| "tier" = "control-plane" | ||
| "environment" = var.environment | ||
| } | ||
| } | ||
|
|
||
| resource "hcloud_ssh_key" "operators" { | ||
| for_each = { for idx, key in var.ssh_public_keys : idx => key } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| name = "eliza-cp-${var.environment}-op-${each.key}" | ||
| public_key = each.value | ||
| labels = local.common_labels | ||
| } | ||
|
|
||
| resource "hcloud_server" "control_plane" { | ||
| for_each = toset([for i in range(var.control_plane_count) : tostring(i + 1)]) | ||
|
|
||
| name = "eliza-cp-${var.environment}-${each.value}" | ||
| location = var.hcloud_location | ||
| server_type = var.hcloud_server_type | ||
| image = var.hcloud_image | ||
| ssh_keys = [for k in hcloud_ssh_key.operators : k.id] | ||
| labels = merge(local.common_labels, { | ||
| "control-plane-index" = each.value | ||
| }) | ||
|
|
||
| user_data = templatefile("${path.module}/cloud-init/bootstrap.yaml.tftpl", { | ||
| hostname = "eliza-cp-${var.environment}-${each.value}" | ||
| deploy_branch = var.deploy_branch | ||
| }) | ||
|
|
||
| # Keep server alive across refactors: changing labels or user_data | ||
| # shouldn't recreate the box, only update in place where possible. | ||
| lifecycle { | ||
| ignore_changes = [ | ||
| user_data, # bootstrap runs once at first boot | ||
| image, # updating image rebuilds — explicit `terraform taint` to opt in | ||
| ] | ||
| } | ||
| } | ||
|
|
||
| resource "cloudflare_dns_record" "control_plane" { | ||
| for_each = hcloud_server.control_plane | ||
|
|
||
| zone_id = var.cloudflare_zone_id | ||
| name = "${var.control_plane_hostname_prefix}-${var.environment}-${each.key}.elizacloud.ai" | ||
| type = "A" | ||
| content = each.value.ipv4_address | ||
| ttl = 60 | ||
| proxied = false | ||
| comment = "eliza control-plane VM ${each.value.name} (managed by terraform/hetzner/control-plane)" | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deployuser has no SSH authorized keys — GitHub Actions deploy workflow will be unable to connectHetzner's SSH key injection only populates root's
~/.ssh/authorized_keys. Thedeployuser is created withlock_passwd: trueand nossh_authorized_keysentry, making it unreachable via SSH. The README's deploy step triggersdeploy-eliza-provisioning-worker.ymlwhich presumably SSHes into this user — that will fail until keys are injected out-of-band.