|
| 1 | +# Hetzner Control Plane vs Data Plane |
| 2 | + |
| 3 | +eliza Cloud runs on a two-tier Hetzner Cloud setup. This doc nails down the |
| 4 | +split so we stop treating manually-created VMs as "infrastructure-by-prayer". |
| 5 | + |
| 6 | +## Layers |
| 7 | + |
| 8 | +``` |
| 9 | +┌──────────────────────────────────────────────────────────────┐ |
| 10 | +│ Tier 1 — Control plane (static, 1-2 VMs, Terraform) │ |
| 11 | +│ │ |
| 12 | +│ eliza-1 (Hetzner cpx21, fsn1) │ |
| 13 | +│ ├── eliza-provisioning-worker (systemd, queue consumer)│ |
| 14 | +│ ├── eliza-agent-router (systemd, HTTP routing) │ |
| 15 | +│ ├── headscale (VPN mesh) │ |
| 16 | +│ ├── cloudflared tunnel (public ingress) │ |
| 17 | +│ ├── nginx (reverse proxy) │ |
| 18 | +│ └── (optional: grafana/prometheus) │ |
| 19 | +│ │ |
| 20 | +│ Lifecycle: long-lived. Replaced on demand, not autoscaled.│ |
| 21 | +│ Cost: ~€5/mo per VM (cpx21). │ |
| 22 | +└──────────────────────────────────────────────────────────────┘ |
| 23 | + │ enqueue / SSH |
| 24 | + ▼ |
| 25 | +┌──────────────────────────────────────────────────────────────┐ |
| 26 | +│ Tier 2 — Data plane (elastic, N cores, runtime autoscale) │ |
| 27 | +│ │ |
| 28 | +│ eliza-core-<hex> (Hetzner cpx32, fsn1) │ |
| 29 | +│ ├── Docker daemon │ |
| 30 | +│ └── eliza-sandbox containers × N │ |
| 31 | +│ │ |
| 32 | +│ Lifecycle: created/drained by node-autoscaler.ts based on │ |
| 33 | +│ real demand. Server limit: ~25 (Hetzner default). │ |
| 34 | +│ Cost: elastic (~€11/mo per running cpx32). │ |
| 35 | +└──────────────────────────────────────────────────────────────┘ |
| 36 | +``` |
| 37 | + |
| 38 | +## Why two tiers |
| 39 | + |
| 40 | +| Concern | Control plane | Data plane | |
| 41 | +|----------------------|------------------------|---------------------------| |
| 42 | +| **Provisioning** | Terraform (one-shot) | Runtime API (node-autoscaler.ts) | |
| 43 | +| **Lifecycle** | Persistent | Ephemeral | |
| 44 | +| **State** | Has local state (headscale DB, cloudflared creds) | Stateless | |
| 45 | +| **Failure mode** | Page someone | Replace automatically | |
| 46 | +| **Cost predictability** | Fixed monthly | Elastic | |
| 47 | +| **What lives here** | Orchestrator, routing, monitoring | Just Docker + agents | |
| 48 | + |
| 49 | +The split prevents the "control plane melts with the data plane during a |
| 50 | +traffic spike" failure mode. Pulling sandboxes off the data plane is the |
| 51 | +autoscaler's job; the orchestrator that issues drain commands must stay up |
| 52 | +to coordinate it. |
| 53 | + |
| 54 | +## Code ↔ infrastructure mapping |
| 55 | + |
| 56 | +| Component | Code | Infra | |
| 57 | +|---|---|---| |
| 58 | +| Control plane VM | [`packages/scripts/cloud/admin/daemons/provisioning-worker.ts`](../../../../scripts/cloud/admin/daemons/provisioning-worker.ts) | [Terraform: `control-plane/`](./control-plane/) | |
| 59 | +| Agent router | [`packages/scripts/cloud/admin/daemons/agent-router.ts`](../../../../scripts/cloud/admin/daemons/agent-router.ts) | systemd unit on control-plane VM | |
| 60 | +| Data plane autoscaler | [`packages/cloud-shared/src/lib/services/containers/node-autoscaler.ts`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | Hetzner Cloud API at runtime | |
| 61 | +| Sandbox provisioning | [`packages/cloud-shared/src/lib/services/docker-sandbox-provider.ts`](../../../../cloud-shared/src/lib/services/docker-sandbox-provider.ts) | SSH from control plane to data plane | |
| 62 | + |
| 63 | +## Naming convention |
| 64 | + |
| 65 | +| Layer | Prefix | Example | Where it's set | |
| 66 | +|---|---|---|---| |
| 67 | +| Control plane VM | `eliza-<n>` | `eliza-1` | Terraform `hcloud_server.control_plane` | |
| 68 | +| Data plane node (NEW) | `eliza-core-<hex>` | `eliza-core-38ea87b1` | [`generateNodeId()`](../../../../cloud-shared/src/lib/services/containers/node-autoscaler.ts) | |
| 69 | +| Data plane node (LEGACY) | `milady-core-<n>` | `milady-core-1` | DEPRECATED — see [Legacy migration](#legacy-milady-core-migration) | |
| 70 | + |
| 71 | +## Legacy `milady-core-*` migration |
| 72 | + |
| 73 | +Pre-2026-05 the data plane was 6 manually-created `milady-core-*` VMs (Hetzner |
| 74 | +cpx32 in fsn1) inserted by-hand into `docker_nodes` with `capacity = 100`. |
| 75 | +By 2026-05-22: |
| 76 | + |
| 77 | +- All 6 cores were `status: offline` (SSH health-check failing for weeks) |
| 78 | +- Several user sandboxes still ran on the underlying Docker daemons |
| 79 | +- The cloud autoscaler couldn't account for them |
| 80 | + |
| 81 | +Migration 0132 (`0132_legacy_milady_cores_disable.sql`) flips them to |
| 82 | +`enabled = false` + fixes `capacity = 8`. This: |
| 83 | + |
| 84 | +1. Removes them from autoscaler capacity decisions |
| 85 | +2. Stops the health-check noise |
| 86 | +3. Lets the autoscaler spin up replacement `eliza-core-<hex>` nodes on demand |
| 87 | + |
| 88 | +Existing sandboxes keep running until next restart. On user-triggered |
| 89 | +restart / recreate, the daemon provisions them on a fresh autoscaled core. |
| 90 | + |
| 91 | +Once `SELECT SUM(allocated_count) FROM docker_nodes WHERE node_id LIKE 'milady-core-%'` |
| 92 | +is `0`, ops can: |
| 93 | + |
| 94 | +```bash |
| 95 | +# 1. Delete the Hetzner Cloud servers (one-time, via Hetzner console or): |
| 96 | +hcloud server delete milady-core-1 |
| 97 | +hcloud server delete milady-core-2 |
| 98 | +# ... etc. |
| 99 | + |
| 100 | +# 2. Drop the DB rows: |
| 101 | +DELETE FROM docker_nodes WHERE node_id LIKE 'milady-core-%'; |
| 102 | +``` |
| 103 | + |
| 104 | +## Followups (not in this initial PR) |
| 105 | + |
| 106 | +- [ ] Terraform module for headscale state (preauth keys, ACLs) |
| 107 | +- [ ] Terraform module for the cloudflared tunnel (currently created by-hand) |
| 108 | +- [ ] Terraform-apply GitHub workflow (`infra/**` path filter) |
| 109 | +- [ ] Move the 4 remaining cron paths off the orphan |
| 110 | + `container-control-plane` service onto the daemon-queue pattern |
| 111 | + (`pool-replenish`, `pool-health-check`, `pool-image-rollout`, |
| 112 | + `deployment-monitor`). Once done, retire the |
| 113 | + `packages/cloud-services/container-control-plane/` package entirely. |
| 114 | +- [ ] Raise Hetzner Cloud server limit (open ticket) to enable autoscale |
| 115 | + past the default cap of ~10 servers per account. |
| 116 | + |
| 117 | +## Operator runbook |
| 118 | + |
| 119 | +See [`control-plane/README.md`](./control-plane/README.md) |
| 120 | +for the step-by-step: |
| 121 | + |
| 122 | +- Bootstrap a brand-new control-plane VM |
| 123 | +- Import the existing production VM into Terraform |
| 124 | +- Verify state, plan, apply |
0 commit comments