feat(multinode): add multi-node cluster support with worker agents by dviejokfs · Pull Request #31 · gotempsh/temps

dviejokfs · 2026-03-05T17:16:18Z

Summary

Multi-node cluster architecture: distribute deployments across control plane + worker nodes connected via WireGuard private networking
Node scheduler: round-robin replica distribution across local (control plane) and remote worker nodes, with target node filtering support
Cross-node service connectivity: env vars are rewritten for remote containers so they reach external services (Postgres, Redis, etc.) on the control plane via private IP + host port instead of Docker container names
Multi-node-aware route table: proxy resolves node.private_address:host_port for containers on worker nodes, enabling traffic routing across the cluster

New crates

temps-agent — worker node agent with Docker runtime, token-based auth, and deploy/status/stop/logs API
temps-wireguard — WireGuard tunnel management for secure node-to-node networking

CLI commands

temps agent — start a worker node agent
temps join — register a worker node with the control plane (direct or relay mode)
temps serve --private-address — set the control plane's private/WireGuard IP for cross-node connectivity

Infrastructure

DB migrations: nodes table, node_id columns on deployment_containers and deployment_config
External services bind to 0.0.0.0 (was 127.0.0.1) so they're reachable from worker nodes via the private network
Remote container deployer via agent HTTP API with health checks and log streaming
Node health check job monitors worker heartbeats and marks stale nodes

Web UI

Nodes management page under Settings with status, heartbeat, and node details

Test plan

cargo check --lib compiles cleanly
All pre-commit hooks pass (clippy, fmt, typos, conventional commit)
Node scheduler unit tests pass (9 tests: round-robin, local fallback, target filtering, edge cases)
Manual: temps serve --private-address <wg-ip> persists the address
Manual: temps join --token <token> --control-plane <url> registers a worker node
Manual: deploy app with 2 replicas — one on control plane, one on worker
Manual: app on worker gets rewritten env vars (e.g. POSTGRES_URL uses private IP)
Manual: proxy routes traffic to both local and remote containers via round-robin

Implement distributed deployment across multiple nodes connected via WireGuard private networking. Includes node scheduler with round-robin distribution (control plane + workers), remote container deployment via agent API, cross-node service connectivity with env var rewriting, and multi-node-aware route table for proxy traffic routing. Key changes: - temps-agent crate: worker node agent with Docker runtime and auth - temps-wireguard crate: WireGuard tunnel management - Node scheduler: round-robin across local + remote nodes - Remote deployer: deploy containers to worker nodes via HTTP API - Cross-node env vars: rewrite service URLs for remote containers - Route table: resolve node private addresses for proxy routing - CLI: `temps agent`, `temps join`, `--private-address` flag - External services bind to 0.0.0.0 for cross-node accessibility - Migrations: nodes table, node_id columns on containers/configs - Web UI: nodes management page in settings

… poll The DB poll fallback in wait_for_route_ready only ran when the queue receiver timed out. If the queue had a steady stream of unrelated events (other jobs, other environments), they triggered `continue` on every iteration, preventing the timeout_at from ever expiring and the poll from ever running. This caused deployments to wait the full 60s timeout and then fail/rollback even though the route was already live. Move the deadline check and DB poll to the top of the loop so they run on every iteration regardless of queue activity.

…tside lock Two changes to prevent mark_deployment_complete from blocking subsequent deployments: 1. Replace pg_advisory_lock (blocks forever) with pg_try_advisory_lock in a retry loop with 120s timeout. If a previous deployment is slow to complete, the next one will time out with a clear error instead of hanging indefinitely. 2. Move cancel_previous_deployments (container teardown) outside the advisory lock. Teardown can be slow (10s graceful stop per container, multiple containers across old deployments) and doesn't need serialization — the route table already points to the new deployment by the time teardown runs.

All env var values are now encrypted before being stored in the database and transparently decrypted on read. Adds an `is_encrypted` boolean column (default false) for backward compatibility with existing unencrypted rows. - Migration: adds `is_encrypted` flag to `env_vars` table - Entity: adds `is_encrypted` field to `env_vars::Model` - EnvVarService (temps-environments, temps-projects): accepts `Arc<EncryptionService>`; encrypts on create/update, decrypts on every read; legacy rows pass through as-is - WorkflowPlanner: decrypts env var values before injecting into deployments - Typed error variants: `EncryptionFailed` and `DecryptionFailed` with key/id context - 8 unit tests covering roundtrip, legacy passthrough, wrong-key error, invalid base64, nonce randomness, and service-level decrypt flows

…riting get_service_effective_address() called get_effective_address() which returns baremetal addresses (localhost:host_port) when DeploymentMode is not Docker. But get_runtime_env_vars() always generates env vars with Docker container names (e.g. postgres-mydb:5432), so the find-and-replace in build_remote_environment_variables() never matched. Add get_docker_container_name() and get_docker_internal_port() to the ExternalService trait and use them in get_service_effective_address() to always return Docker container names regardless of deployment mode.

PostgreSQL session-level advisory locks (pg_try_advisory_lock / pg_advisory_unlock) don't work correctly with Sea-ORM's connection pool: lock and unlock calls can hit different pooled connections, leaving locks permanently held and blocking all subsequent deployments for that environment until server restart. Replace with a process-level tokio::sync::Mutex per environment_id stored in a static HashMap. This is correct because Temps runs as a single process, and the mutex is automatically released when the guard is dropped (even on panic/error paths).

…ctions The proxy had no upstream connection timeouts and no retry logic. When Pingora reused a pooled keep-alive connection that the upstream had already closed, the TCP RST caused an immediate 503 ("Connection reset by peer") with zero bytes read. - Set connection_timeout (5s), read_timeout (60s), write_timeout (60s), and idle_timeout (60s) on every upstream peer - Retry once on connection failure via fail_to_connect + set_retry(true), which makes Pingora open a fresh connection instead of reusing a stale pooled one

The upstream includes content-length in HEAD responses (correct per RFC 9110), but when proxied over HTTP/2, clients like curl interpret it as a body-bytes promise and error with "transfer closed with N bytes remaining". Cloudflare handles this by stripping content-length on HEAD responses; do the same in upstream_response_filter.

…ovements

- pingora-core 0.7.0 → 0.8.0 (HTTP Request Smuggling CVEs) - aws-lc-sys 0.32.3 → 0.38.0 (PKCS7 bypass, timing side-channel) - jsonwebtoken 9 → 10.3.0 in google-indexing-plugin (type confusion) - Flask 3.0.0 → 3.1.3 in example app (session cookie fix)

…ode management - Add alarm service with fire/resolve lifecycle and alarm entity/migration - Add container health monitor with CPU, memory, restart, and OOM detection - Implement node scheduler with multi-node deployment support - Add node CLI commands and agent health check handlers - Enhance proxy with multi-node upstream routing - Add node management UI page with status and actions - Add alarm detector pipeline design doc

…node management - Fix old containers not being cleaned up on worker nodes during redeployment by routing teardown through RemoteNodeDeployer when container has a node_id - Add encryption_service to MarkDeploymentCompleteJob and DeploymentService for decrypting node tokens during remote teardown - Improve node health checks, proxy routing, and CLI node management

…and integration tests Add container labels for reconciliation, list_containers agent endpoint, anti-affinity environment settings, fix auth and WebSocket log tests, and add multinode integration test suite.

The Role enum gained a Demo variant but the test still expected 6 roles.

dviejokfs added 17 commits March 5, 2026 16:36

fix(multinode): skip building remote env vars when no active nodes exist

c2a5737

docs(changelog): add entries for multi-node cluster support

1327a29

fix(web): clean up env vars settings and add preview flag support

b6999bf

feat(multinode): add container restart count and node management impr…

d16551a

…ovements

feat(multinode): add container reconciliation, scheduling config UI, …

83600d1

…and integration tests Add container labels for reconciliation, list_containers agent endpoint, anti-affinity environment settings, fix auth and WebSocket log tests, and add multinode integration test suite.

fix(auth): update test_role_all assertion for Demo role

f4af6b4

The Role enum gained a Demo variant but the test still expected 6 roles.

dviejokfs merged commit f4af6b4 into main Mar 10, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(multinode): add multi-node cluster support with worker agents#31

feat(multinode): add multi-node cluster support with worker agents#31
dviejokfs merged 17 commits intomainfrom
feat/multinode

dviejokfs commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dviejokfs commented Mar 5, 2026

Summary

New crates

CLI commands

Infrastructure

Web UI

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant