-
Notifications
You must be signed in to change notification settings - Fork 5.5k
feat(cloud): Hetzner control plane IaC + data plane naming + legacy milady-core deprecation #7890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
124 changes: 124 additions & 0 deletions
124
packages/cloud-infra/cloud/terraform/hetzner/ARCHITECTURE.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,124 @@ | ||
| # Hetzner Control Plane vs Data Plane | ||
|
|
||
| eliza Cloud runs on a two-tier Hetzner Cloud setup. This doc nails down the | ||
| split so we stop treating manually-created VMs as "infrastructure-by-prayer". | ||
|
|
||
| ## Layers | ||
|
|
||
| ``` | ||
| ┌──────────────────────────────────────────────────────────────┐ | ||
| │ Tier 1 — Control plane (static, 1-2 VMs, Terraform) │ | ||
| │ │ | ||
| │ eliza-1 (Hetzner cpx21, fsn1) │ | ||
| │ ├── eliza-provisioning-worker (systemd, queue consumer)│ | ||
| │ ├── eliza-agent-router (systemd, HTTP routing) │ | ||
| │ ├── headscale (VPN mesh) │ | ||
| │ ├── cloudflared tunnel (public ingress) │ | ||
| │ ├── nginx (reverse proxy) │ | ||
| │ └── (optional: grafana/prometheus) │ | ||
| │ │ | ||
| │ Lifecycle: long-lived. Replaced on demand, not autoscaled.│ | ||
| │ Cost: ~€5/mo per VM (cpx21). │ | ||
| └──────────────────────────────────────────────────────────────┘ | ||
| │ enqueue / SSH | ||
| ▼ | ||
| ┌──────────────────────────────────────────────────────────────┐ | ||
| │ Tier 2 — Data plane (elastic, N cores, runtime autoscale) │ | ||
| │ │ | ||
| │ eliza-core-<hex> (Hetzner cpx32, fsn1) │ | ||
| │ ├── Docker daemon │ | ||
| │ └── eliza-sandbox containers × N │ | ||
| │ │ | ||
| │ Lifecycle: created/drained by node-autoscaler.ts based on │ | ||
| │ real demand. Server limit: ~25 (Hetzner default). │ | ||
| │ Cost: elastic (~€11/mo per running cpx32). │ | ||
| └──────────────────────────────────────────────────────────────┘ | ||
| ``` | ||
|
|
||
| ## Why two tiers | ||
|
|
||
| | Concern | Control plane | Data plane | | ||
| |----------------------|------------------------|---------------------------| | ||
| | **Provisioning** | Terraform (one-shot) | Runtime API (node-autoscaler.ts) | | ||
| | **Lifecycle** | Persistent | Ephemeral | | ||
| | **State** | Has local state (headscale DB, cloudflared creds) | Stateless | | ||
| | **Failure mode** | Page someone | Replace automatically | | ||
| | **Cost predictability** | Fixed monthly | Elastic | | ||
| | **What lives here** | Orchestrator, routing, monitoring | Just Docker + agents | | ||
|
|
||
| The split prevents the "control plane melts with the data plane during a | ||
| traffic spike" failure mode. Pulling sandboxes off the data plane is the | ||
| autoscaler's job; the orchestrator that issues drain commands must stay up | ||
| to coordinate it. | ||
|
|
||
| ## Code ↔ infrastructure mapping | ||
|
|
||
| | Component | Code | Infra | | ||
| |---|---|---| | ||
| | Control plane VM | [`packages/scripts/cloud/admin/daemons/provisioning-worker.ts`](../../../../scripts/cloud/admin/daemons/provisioning-worker.ts) | [Terraform: `control-plane/`](./control-plane/) | | ||
| | Agent router | [`packages/scripts/cloud/admin/daemons/agent-router.ts`](../../../../scripts/cloud/admin/daemons/agent-router.ts) | systemd unit on control-plane VM | | ||
| | Data plane autoscaler | [`packages/cloud-shared/src/lib/services/containers/node-autoscaler.ts`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | Hetzner Cloud API at runtime | | ||
| | Sandbox provisioning | [`packages/cloud-shared/src/lib/services/docker-sandbox-provider.ts`](../../../../cloud-shared/src/lib/services/docker-sandbox-provider.ts) | SSH from control plane to data plane | | ||
|
|
||
| ## Naming convention | ||
|
|
||
| | Layer | Prefix | Example | Where it's set | | ||
| |---|---|---|---| | ||
| | Control plane VM | `eliza-<n>` | `eliza-1` | Terraform `hcloud_server.control_plane` | | ||
| | Data plane node (NEW) | `eliza-core-<hex>` | `eliza-core-38ea87b1` | [`generateNodeId()`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | | ||
| | Data plane node (LEGACY) | `milady-core-<n>` | `milady-core-1` | DEPRECATED — see [Legacy migration](#legacy-milady-core-migration) | | ||
|
|
||
| ## Legacy `milady-core-*` migration | ||
|
|
||
| Pre-2026-05 the data plane was 6 manually-created `milady-core-*` VMs (Hetzner | ||
| cpx32 in fsn1) inserted by-hand into `docker_nodes` with `capacity = 100`. | ||
| By 2026-05-22: | ||
|
|
||
| - All 6 cores were `status: offline` (SSH health-check failing for weeks) | ||
| - Several user sandboxes still ran on the underlying Docker daemons | ||
| - The cloud autoscaler couldn't account for them | ||
|
|
||
| Migration 0132 (`0132_legacy_milady_cores_disable.sql`) flips them to | ||
| `enabled = false` + fixes `capacity = 8`. This: | ||
|
|
||
| 1. Removes them from autoscaler capacity decisions | ||
| 2. Stops the health-check noise | ||
| 3. Lets the autoscaler spin up replacement `eliza-core-<hex>` nodes on demand | ||
|
|
||
| Existing sandboxes keep running until next restart. On user-triggered | ||
| restart / recreate, the daemon provisions them on a fresh autoscaled core. | ||
|
|
||
| Once `SELECT SUM(allocated_count) FROM docker_nodes WHERE node_id LIKE 'milady-core-%'` | ||
| is `0`, ops can: | ||
|
|
||
| ```bash | ||
| # 1. Delete the Hetzner Cloud servers (one-time, via Hetzner console or): | ||
| hcloud server delete milady-core-1 | ||
| hcloud server delete milady-core-2 | ||
| # ... etc. | ||
|
|
||
| # 2. Drop the DB rows: | ||
| DELETE FROM docker_nodes WHERE node_id LIKE 'milady-core-%'; | ||
| ``` | ||
|
|
||
| ## Followups (not in this initial PR) | ||
|
|
||
| - [ ] Terraform module for headscale state (preauth keys, ACLs) | ||
| - [ ] Terraform module for the cloudflared tunnel (currently created by-hand) | ||
| - [ ] Terraform-apply GitHub workflow (`infra/**` path filter) | ||
| - [ ] Move the 4 remaining cron paths off the orphan | ||
| `container-control-plane` service onto the daemon-queue pattern | ||
| (`pool-replenish`, `pool-health-check`, `pool-image-rollout`, | ||
| `deployment-monitor`). Once done, retire the | ||
| `packages/cloud-services/container-control-plane/` package entirely. | ||
| - [ ] Raise Hetzner Cloud server limit (open ticket) to enable autoscale | ||
| past the default cap of ~10 servers per account. | ||
|
|
||
| ## Operator runbook | ||
|
|
||
| See [`control-plane/README.md`](./control-plane/README.md) | ||
| for the step-by-step: | ||
|
|
||
| - Bootstrap a brand-new control-plane VM | ||
| - Import the existing production VM into Terraform | ||
| - Verify state, plan, apply |
11 changes: 11 additions & 0 deletions
11
packages/cloud-infra/cloud/terraform/hetzner/control-plane/.gitignore
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| .terraform/ | ||
| .terraform.lock.hcl | ||
| *.tfstate | ||
| *.tfstate.backup | ||
| crash.log | ||
| crash.*.log | ||
|
|
||
| # Live tfvars contain SSH public keys + zone IDs — keep them out of git. | ||
| # Only `.tfvars.example` lives in the repo. | ||
| *.tfvars | ||
| !*.tfvars.example |
88 changes: 88 additions & 0 deletions
88
packages/cloud-infra/cloud/terraform/hetzner/control-plane/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| # hetzner/control-plane — Terraform for the eliza Cloud control-plane VMs | ||
|
|
||
| This Terraform module manages the **persistent** Hetzner Cloud VM(s) that host | ||
| the elizaOS Cloud control-plane: | ||
|
|
||
| - `eliza-provisioning-worker` — pulls jobs from the `jobs` table and SSHs | ||
| into sandbox cores | ||
| - `eliza-agent-router` — subdomain HTTP routing | ||
| - `cloudflared` — secure tunnel for `sandboxes.waifu.fun` | ||
| - `headscale` — VPN mesh for cross-core agent traffic | ||
|
|
||
| The **data plane** (the sandbox cores themselves) is **not** managed here — | ||
| those are provisioned and drained at runtime by | ||
| [`node-autoscaler.ts`](../../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | ||
| which talks to the Hetzner Cloud API directly. See | ||
| [`../ARCHITECTURE.md`](../ARCHITECTURE.md) for the full split. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| 1. **Hetzner Cloud project** with API token (`HCLOUD_TOKEN`). | ||
| 2. **Cloudflare account** with API token + DNS edit on `elizacloud.ai` | ||
| (`CLOUDFLARE_API_TOKEN`). | ||
| 3. **Cloudflare R2 bucket** `eliza-terraform-state` for remote state. Generate | ||
| an R2 API token, edit `backend-staging.hcl` / `backend-production.hcl` | ||
| with your CF account ID, then export the R2 token as | ||
| `AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` before `terraform init`. | ||
| 4. **Terraform >= 1.5.0** locally. | ||
|
|
||
| ## Bootstrap a brand-new control-plane VM (staging) | ||
|
|
||
| ```bash | ||
| cd packages/cloud-infra/cloud/terraform/hetzner/control-plane | ||
|
|
||
| # 1. Pull providers + connect remote state. | ||
| terraform init -backend-config=backend-staging.hcl | ||
|
|
||
| # 2. Copy + fill tfvars. | ||
| cp tfvars/staging.tfvars.example tfvars/staging.tfvars | ||
| $EDITOR tfvars/staging.tfvars | ||
|
|
||
| # 3. Plan + apply. | ||
| export HCLOUD_TOKEN=... | ||
| export CLOUDFLARE_API_TOKEN=... | ||
| terraform plan -var-file=tfvars/staging.tfvars | ||
| terraform apply -var-file=tfvars/staging.tfvars | ||
|
|
||
| # 4. Output gives you the VM IP. Copy the cloud env file into place: | ||
| scp packages/cloud-shared/.env.local root@<vm-ip>:/opt/eliza/cloud/.env.local | ||
|
|
||
| # 5. Trigger first deploy from GitHub Actions | ||
| # (workflow: deploy-eliza-provisioning-worker.yml, manual dispatch). | ||
| ``` | ||
|
|
||
| ## Adopt the existing production VM into Terraform | ||
|
|
||
| The current prod manager VM (`89.167.63.246`, historical hostname | ||
| `milady`) was created by hand in May 2026. To bring it under Terraform | ||
| without recreating it, look up the Hetzner Cloud server ID | ||
| (`hcloud server list`), then `terraform import 'hcloud_server.control_plane["1"]' <id>` | ||
| plus a `terraform import` for each existing `hcloud_ssh_key`. The first | ||
| plan after import shows the in-place rename `milady → eliza-1`, the new | ||
| labels, and the Cloudflare DNS record creation; `user_data` and `image` | ||
| diffs are suppressed by `lifecycle { ignore_changes }`. One-shot — never | ||
| re-run. | ||
|
|
||
| ## What this module does NOT manage (yet) | ||
|
|
||
| - Headscale state (preauth keys, ACLs) — manual via `headscale` CLI. | ||
| - Cloudflared tunnels — config lives at `/root/.cloudflared/` on the VM and | ||
| is created via `cloudflared tunnel create` one-shot. | ||
| - The systemd units — installed by `deploy-eliza-provisioning-worker.yml` | ||
| on every push. | ||
| - The actual eliza Cloud sandbox cores (data plane) — runtime autoscale. | ||
|
|
||
| These are tracked as TODOs in | ||
| [`../ARCHITECTURE.md`](../ARCHITECTURE.md#followups). | ||
|
|
||
| ## Cost | ||
|
|
||
| | Component | Resource | Monthly (€) | | ||
| |------------------------------|-------------|-------------| | ||
| | 1× cpx21 (3 vCPU / 4 GB) | control VM | ~5 | | ||
| | 1× IPv4 + IPv6 | floating IP | included | | ||
| | Cloudflare R2 state | < 100 KB | 0 | | ||
| | **Total per environment** | | **~5** | | ||
|
|
||
| A 2nd control-plane VM (HA, currently unused) doubles the line. The | ||
| **data-plane autoscale** cost is separate and elastic. |
9 changes: 9 additions & 0 deletions
9
packages/cloud-infra/cloud/terraform/hetzner/control-plane/backend-production.hcl
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| bucket = "eliza-terraform-state" | ||
| key = "hetzner/control-plane/production.tfstate" | ||
| region = "auto" | ||
| endpoints = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" } | ||
| skip_credentials_validation = true | ||
| skip_metadata_api_check = true | ||
| skip_region_validation = true | ||
| skip_requesting_account_id = true | ||
| use_path_style = true |
20 changes: 20 additions & 0 deletions
20
packages/cloud-infra/cloud/terraform/hetzner/control-plane/backend-staging.hcl
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| # Cloudflare R2 backend (S3-compatible) for terraform state. | ||
| # | ||
| # Set up once per environment: | ||
| # 1. Create R2 bucket `eliza-terraform-state` in the elizaOS CF account. | ||
| # 2. Generate R2 API token with read/write on that bucket. | ||
| # 3. Export AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY pointing at the | ||
| # R2 token before `terraform init`. | ||
| # | ||
| # Usage: | ||
| # terraform init -backend-config=backend-staging.hcl | ||
|
|
||
| bucket = "eliza-terraform-state" | ||
| key = "hetzner/control-plane/staging.tfstate" | ||
| region = "auto" | ||
| endpoints = { s3 = "https://23cf6feaeaa541f6a0675053c33da768.r2.cloudflarestorage.com" } | ||
| skip_credentials_validation = true | ||
| skip_metadata_api_check = true | ||
| skip_region_validation = true | ||
| skip_requesting_account_id = true | ||
| use_path_style = true |
85 changes: 85 additions & 0 deletions
85
packages/cloud-infra/cloud/terraform/hetzner/control-plane/cloud-init/bootstrap.yaml.tftpl
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| #cloud-config | ||
| # Cloud-init bootstrap for eliza control-plane VMs. | ||
| # | ||
| # At first boot: | ||
| # 1. Install system deps (docker, nginx, node, jq, postgresql-client, ssh) | ||
| # 2. Add a `deploy` user that the GitHub Actions workflow | ||
| # `.github/workflows/deploy-eliza-provisioning-worker.yml` SSHes into | ||
| # 3. Clone the monorepo on `${deploy_branch}` to /opt/eliza | ||
| # 4. Install systemd units for `eliza-provisioning-worker` and | ||
| # `eliza-agent-router` | ||
| # | ||
| # Secrets (DATABASE_URL, KV_REST_API_*, STEWARD_*, HCLOUD_TOKEN, etc.) are NOT | ||
| # baked in here. The operator copies `cloud/.env.local` to `/opt/eliza/cloud/.env.local` | ||
| # in a follow-up step (one-shot, out-of-band) using `scp` or the existing | ||
| # `packages/scripts/cloud/admin/bootstrap-provisioning-worker-host.mjs` script. | ||
|
|
||
| hostname: ${hostname} | ||
| manage_etc_hosts: true | ||
|
|
||
| users: | ||
| - name: deploy | ||
| groups: sudo, docker | ||
| shell: /bin/bash | ||
| sudo: ALL=(ALL) NOPASSWD:ALL | ||
| lock_passwd: true | ||
| # Mirror operator SSH keys onto the deploy user. Hetzner only injects | ||
| # `hcloud_ssh_key` entries into root by default, but the | ||
| # `.github/workflows/deploy-eliza-provisioning-worker.yml` workflow SSHes | ||
| # in as `deploy`. Without this list the first auto-deploy would fail. | ||
| ssh_authorized_keys: | ||
| %{ for key in operator_ssh_keys ~} | ||
| - ${key} | ||
| %{ endfor ~} | ||
|
|
||
| package_update: true | ||
| package_upgrade: false | ||
|
|
||
| apt: | ||
| sources: | ||
| docker: | ||
| source: "deb [arch=amd64 signed-by=$KEY_FILE] https://download.docker.com/linux/ubuntu $RELEASE stable" | ||
| keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88 | ||
|
|
||
| packages: | ||
| - curl | ||
| - docker-ce | ||
| - docker-ce-cli | ||
| - containerd.io | ||
| - git | ||
| - jq | ||
| - nginx | ||
| - postgresql-client | ||
| - rsync | ||
| - unzip | ||
|
|
||
| write_files: | ||
| - path: /etc/profile.d/eliza-paths.sh | ||
| permissions: "0644" | ||
| content: | | ||
| export PATH="/home/deploy/.bun/bin:$PATH" | ||
| export ELIZA_DEPLOY_BRANCH="${deploy_branch}" | ||
|
|
||
| runcmd: | ||
| # Docker is installed via the `apt` block above using the official | ||
| # Docker apt repo (GPG-verified keyring), not `curl get.docker.com | sh`. | ||
| - systemctl enable --now docker | ||
|
|
||
| # Bun runtime for the deploy user. We download a pinned release tarball | ||
| # AND its SHASUMS256 manifest from the same GitHub release, then verify | ||
| # the binary against the published hash before extracting. This avoids | ||
| # the `curl bun.sh/install | bash` supply-chain pattern flagged in the | ||
| # PR review — we still trust GitHub's HTTPS, but the manifest commits | ||
| # the publisher to a specific hash that an attacker can't forge mid-flight. | ||
| - install -d -o deploy -g deploy /home/deploy/.bun /home/deploy/.bun/bin | ||
| - su - deploy -c 'curl -fsSL -o /tmp/bun.zip https://github.com/oven-sh/bun/releases/download/bun-v1.3.13/bun-linux-x64.zip' | ||
| - su - deploy -c 'curl -fsSL -o /tmp/bun-SHASUMS256.txt https://github.com/oven-sh/bun/releases/download/bun-v1.3.13/SHASUMS256.txt' | ||
| - su - deploy -c 'cd /tmp && grep "bun-linux-x64.zip" bun-SHASUMS256.txt | sha256sum -c -' | ||
| - su - deploy -c 'cd /tmp && unzip -q bun.zip && install -m 0755 bun-linux-x64/bun /home/deploy/.bun/bin/bun && rm -rf /tmp/bun.zip /tmp/bun-linux-x64 /tmp/bun-SHASUMS256.txt' | ||
|
|
||
| # Repo checkout — the GitHub deploy workflow then takes over for code updates. | ||
| - mkdir -p /opt/eliza && chown deploy:deploy /opt/eliza | ||
| - su - deploy -c 'git clone --depth 200 --branch ${deploy_branch} https://github.com/elizaOS/eliza.git /opt/eliza' | ||
|
|
||
| # Final marker for the bootstrap-warn check that runs after this. | ||
| - echo "eliza-control-plane bootstrap finished $(date -u +%Y-%m-%dT%H:%M:%SZ)" > /var/log/eliza-bootstrap.log | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deployuser has no SSH authorized keys — GitHub Actions deploy workflow will be unable to connectHetzner's SSH key injection only populates root's
~/.ssh/authorized_keys. Thedeployuser is created withlock_passwd: trueand nossh_authorized_keysentry, making it unreachable via SSH. The README's deploy step triggersdeploy-eliza-provisioning-worker.ymlwhich presumably SSHes into this user — that will fail until keys are injected out-of-band.