Skip to content

feat(multinode): add multi-node cluster support with worker agents#31

Merged
dviejokfs merged 17 commits intomainfrom
feat/multinode
Mar 10, 2026
Merged

feat(multinode): add multi-node cluster support with worker agents#31
dviejokfs merged 17 commits intomainfrom
feat/multinode

Conversation

@dviejokfs
Copy link
Contributor

Summary

  • Multi-node cluster architecture: distribute deployments across control plane + worker nodes connected via WireGuard private networking
  • Node scheduler: round-robin replica distribution across local (control plane) and remote worker nodes, with target node filtering support
  • Cross-node service connectivity: env vars are rewritten for remote containers so they reach external services (Postgres, Redis, etc.) on the control plane via private IP + host port instead of Docker container names
  • Multi-node-aware route table: proxy resolves node.private_address:host_port for containers on worker nodes, enabling traffic routing across the cluster

New crates

  • temps-agent — worker node agent with Docker runtime, token-based auth, and deploy/status/stop/logs API
  • temps-wireguard — WireGuard tunnel management for secure node-to-node networking

CLI commands

  • temps agent — start a worker node agent
  • temps join — register a worker node with the control plane (direct or relay mode)
  • temps serve --private-address — set the control plane's private/WireGuard IP for cross-node connectivity

Infrastructure

  • DB migrations: nodes table, node_id columns on deployment_containers and deployment_config
  • External services bind to 0.0.0.0 (was 127.0.0.1) so they're reachable from worker nodes via the private network
  • Remote container deployer via agent HTTP API with health checks and log streaming
  • Node health check job monitors worker heartbeats and marks stale nodes

Web UI

  • Nodes management page under Settings with status, heartbeat, and node details

Test plan

  • cargo check --lib compiles cleanly
  • All pre-commit hooks pass (clippy, fmt, typos, conventional commit)
  • Node scheduler unit tests pass (9 tests: round-robin, local fallback, target filtering, edge cases)
  • Manual: temps serve --private-address <wg-ip> persists the address
  • Manual: temps join --token <token> --control-plane <url> registers a worker node
  • Manual: deploy app with 2 replicas — one on control plane, one on worker
  • Manual: app on worker gets rewritten env vars (e.g. POSTGRES_URL uses private IP)
  • Manual: proxy routes traffic to both local and remote containers via round-robin

dviejokfs added 17 commits March 5, 2026 16:36
Implement distributed deployment across multiple nodes connected via
WireGuard private networking. Includes node scheduler with round-robin
distribution (control plane + workers), remote container deployment via
agent API, cross-node service connectivity with env var rewriting, and
multi-node-aware route table for proxy traffic routing.

Key changes:
- temps-agent crate: worker node agent with Docker runtime and auth
- temps-wireguard crate: WireGuard tunnel management
- Node scheduler: round-robin across local + remote nodes
- Remote deployer: deploy containers to worker nodes via HTTP API
- Cross-node env vars: rewrite service URLs for remote containers
- Route table: resolve node private addresses for proxy routing
- CLI: `temps agent`, `temps join`, `--private-address` flag
- External services bind to 0.0.0.0 for cross-node accessibility
- Migrations: nodes table, node_id columns on containers/configs
- Web UI: nodes management page in settings
… poll

The DB poll fallback in wait_for_route_ready only ran when the queue
receiver timed out. If the queue had a steady stream of unrelated events
(other jobs, other environments), they triggered `continue` on every
iteration, preventing the timeout_at from ever expiring and the poll
from ever running. This caused deployments to wait the full 60s timeout
and then fail/rollback even though the route was already live.

Move the deadline check and DB poll to the top of the loop so they run
on every iteration regardless of queue activity.
…tside lock

Two changes to prevent mark_deployment_complete from blocking subsequent
deployments:

1. Replace pg_advisory_lock (blocks forever) with pg_try_advisory_lock
   in a retry loop with 120s timeout. If a previous deployment is slow
   to complete, the next one will time out with a clear error instead of
   hanging indefinitely.

2. Move cancel_previous_deployments (container teardown) outside the
   advisory lock. Teardown can be slow (10s graceful stop per container,
   multiple containers across old deployments) and doesn't need
   serialization — the route table already points to the new deployment
   by the time teardown runs.
All env var values are now encrypted before being stored in the database
and transparently decrypted on read. Adds an `is_encrypted` boolean column
(default false) for backward compatibility with existing unencrypted rows.

- Migration: adds `is_encrypted` flag to `env_vars` table
- Entity: adds `is_encrypted` field to `env_vars::Model`
- EnvVarService (temps-environments, temps-projects): accepts `Arc<EncryptionService>`;
  encrypts on create/update, decrypts on every read; legacy rows pass through as-is
- WorkflowPlanner: decrypts env var values before injecting into deployments
- Typed error variants: `EncryptionFailed` and `DecryptionFailed` with key/id context
- 8 unit tests covering roundtrip, legacy passthrough, wrong-key error, invalid
  base64, nonce randomness, and service-level decrypt flows
…riting

get_service_effective_address() called get_effective_address() which
returns baremetal addresses (localhost:host_port) when DeploymentMode
is not Docker. But get_runtime_env_vars() always generates env vars
with Docker container names (e.g. postgres-mydb:5432), so the
find-and-replace in build_remote_environment_variables() never matched.

Add get_docker_container_name() and get_docker_internal_port() to the
ExternalService trait and use them in get_service_effective_address()
to always return Docker container names regardless of deployment mode.
PostgreSQL session-level advisory locks (pg_try_advisory_lock /
pg_advisory_unlock) don't work correctly with Sea-ORM's connection
pool: lock and unlock calls can hit different pooled connections,
leaving locks permanently held and blocking all subsequent deployments
for that environment until server restart.

Replace with a process-level tokio::sync::Mutex per environment_id
stored in a static HashMap. This is correct because Temps runs as a
single process, and the mutex is automatically released when the
guard is dropped (even on panic/error paths).
…ctions

The proxy had no upstream connection timeouts and no retry logic. When
Pingora reused a pooled keep-alive connection that the upstream had
already closed, the TCP RST caused an immediate 503 ("Connection reset
by peer") with zero bytes read.

- Set connection_timeout (5s), read_timeout (60s), write_timeout (60s),
  and idle_timeout (60s) on every upstream peer
- Retry once on connection failure via fail_to_connect + set_retry(true),
  which makes Pingora open a fresh connection instead of reusing a stale
  pooled one
The upstream includes content-length in HEAD responses (correct per
RFC 9110), but when proxied over HTTP/2, clients like curl interpret
it as a body-bytes promise and error with "transfer closed with N
bytes remaining". Cloudflare handles this by stripping content-length
on HEAD responses; do the same in upstream_response_filter.
- pingora-core 0.7.0 → 0.8.0 (HTTP Request Smuggling CVEs)
- aws-lc-sys 0.32.3 → 0.38.0 (PKCS7 bypass, timing side-channel)
- jsonwebtoken 9 → 10.3.0 in google-indexing-plugin (type confusion)
- Flask 3.0.0 → 3.1.3 in example app (session cookie fix)
…ode management

- Add alarm service with fire/resolve lifecycle and alarm entity/migration
- Add container health monitor with CPU, memory, restart, and OOM detection
- Implement node scheduler with multi-node deployment support
- Add node CLI commands and agent health check handlers
- Enhance proxy with multi-node upstream routing
- Add node management UI page with status and actions
- Add alarm detector pipeline design doc
…node management

- Fix old containers not being cleaned up on worker nodes during
  redeployment by routing teardown through RemoteNodeDeployer when
  container has a node_id
- Add encryption_service to MarkDeploymentCompleteJob and
  DeploymentService for decrypting node tokens during remote teardown
- Improve node health checks, proxy routing, and CLI node management
…and integration tests

Add container labels for reconciliation, list_containers agent endpoint,
anti-affinity environment settings, fix auth and WebSocket log tests,
and add multinode integration test suite.
The Role enum gained a Demo variant but the test still expected 6 roles.
@dviejokfs dviejokfs merged commit f4af6b4 into main Mar 10, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant