feat(multinode): add multi-node cluster support with worker agents#31
Merged
feat(multinode): add multi-node cluster support with worker agents#31
Conversation
Implement distributed deployment across multiple nodes connected via WireGuard private networking. Includes node scheduler with round-robin distribution (control plane + workers), remote container deployment via agent API, cross-node service connectivity with env var rewriting, and multi-node-aware route table for proxy traffic routing. Key changes: - temps-agent crate: worker node agent with Docker runtime and auth - temps-wireguard crate: WireGuard tunnel management - Node scheduler: round-robin across local + remote nodes - Remote deployer: deploy containers to worker nodes via HTTP API - Cross-node env vars: rewrite service URLs for remote containers - Route table: resolve node private addresses for proxy routing - CLI: `temps agent`, `temps join`, `--private-address` flag - External services bind to 0.0.0.0 for cross-node accessibility - Migrations: nodes table, node_id columns on containers/configs - Web UI: nodes management page in settings
… poll The DB poll fallback in wait_for_route_ready only ran when the queue receiver timed out. If the queue had a steady stream of unrelated events (other jobs, other environments), they triggered `continue` on every iteration, preventing the timeout_at from ever expiring and the poll from ever running. This caused deployments to wait the full 60s timeout and then fail/rollback even though the route was already live. Move the deadline check and DB poll to the top of the loop so they run on every iteration regardless of queue activity.
…tside lock Two changes to prevent mark_deployment_complete from blocking subsequent deployments: 1. Replace pg_advisory_lock (blocks forever) with pg_try_advisory_lock in a retry loop with 120s timeout. If a previous deployment is slow to complete, the next one will time out with a clear error instead of hanging indefinitely. 2. Move cancel_previous_deployments (container teardown) outside the advisory lock. Teardown can be slow (10s graceful stop per container, multiple containers across old deployments) and doesn't need serialization — the route table already points to the new deployment by the time teardown runs.
All env var values are now encrypted before being stored in the database and transparently decrypted on read. Adds an `is_encrypted` boolean column (default false) for backward compatibility with existing unencrypted rows. - Migration: adds `is_encrypted` flag to `env_vars` table - Entity: adds `is_encrypted` field to `env_vars::Model` - EnvVarService (temps-environments, temps-projects): accepts `Arc<EncryptionService>`; encrypts on create/update, decrypts on every read; legacy rows pass through as-is - WorkflowPlanner: decrypts env var values before injecting into deployments - Typed error variants: `EncryptionFailed` and `DecryptionFailed` with key/id context - 8 unit tests covering roundtrip, legacy passthrough, wrong-key error, invalid base64, nonce randomness, and service-level decrypt flows
…riting get_service_effective_address() called get_effective_address() which returns baremetal addresses (localhost:host_port) when DeploymentMode is not Docker. But get_runtime_env_vars() always generates env vars with Docker container names (e.g. postgres-mydb:5432), so the find-and-replace in build_remote_environment_variables() never matched. Add get_docker_container_name() and get_docker_internal_port() to the ExternalService trait and use them in get_service_effective_address() to always return Docker container names regardless of deployment mode.
PostgreSQL session-level advisory locks (pg_try_advisory_lock / pg_advisory_unlock) don't work correctly with Sea-ORM's connection pool: lock and unlock calls can hit different pooled connections, leaving locks permanently held and blocking all subsequent deployments for that environment until server restart. Replace with a process-level tokio::sync::Mutex per environment_id stored in a static HashMap. This is correct because Temps runs as a single process, and the mutex is automatically released when the guard is dropped (even on panic/error paths).
…ctions
The proxy had no upstream connection timeouts and no retry logic. When
Pingora reused a pooled keep-alive connection that the upstream had
already closed, the TCP RST caused an immediate 503 ("Connection reset
by peer") with zero bytes read.
- Set connection_timeout (5s), read_timeout (60s), write_timeout (60s),
and idle_timeout (60s) on every upstream peer
- Retry once on connection failure via fail_to_connect + set_retry(true),
which makes Pingora open a fresh connection instead of reusing a stale
pooled one
The upstream includes content-length in HEAD responses (correct per RFC 9110), but when proxied over HTTP/2, clients like curl interpret it as a body-bytes promise and error with "transfer closed with N bytes remaining". Cloudflare handles this by stripping content-length on HEAD responses; do the same in upstream_response_filter.
- pingora-core 0.7.0 → 0.8.0 (HTTP Request Smuggling CVEs) - aws-lc-sys 0.32.3 → 0.38.0 (PKCS7 bypass, timing side-channel) - jsonwebtoken 9 → 10.3.0 in google-indexing-plugin (type confusion) - Flask 3.0.0 → 3.1.3 in example app (session cookie fix)
…ode management - Add alarm service with fire/resolve lifecycle and alarm entity/migration - Add container health monitor with CPU, memory, restart, and OOM detection - Implement node scheduler with multi-node deployment support - Add node CLI commands and agent health check handlers - Enhance proxy with multi-node upstream routing - Add node management UI page with status and actions - Add alarm detector pipeline design doc
…node management - Fix old containers not being cleaned up on worker nodes during redeployment by routing teardown through RemoteNodeDeployer when container has a node_id - Add encryption_service to MarkDeploymentCompleteJob and DeploymentService for decrypting node tokens during remote teardown - Improve node health checks, proxy routing, and CLI node management
…and integration tests Add container labels for reconciliation, list_containers agent endpoint, anti-affinity environment settings, fix auth and WebSocket log tests, and add multinode integration test suite.
The Role enum gained a Demo variant but the test still expected 6 roles.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
node.private_address:host_portfor containers on worker nodes, enabling traffic routing across the clusterNew crates
temps-agent— worker node agent with Docker runtime, token-based auth, and deploy/status/stop/logs APItemps-wireguard— WireGuard tunnel management for secure node-to-node networkingCLI commands
temps agent— start a worker node agenttemps join— register a worker node with the control plane (direct or relay mode)temps serve --private-address— set the control plane's private/WireGuard IP for cross-node connectivityInfrastructure
nodestable,node_idcolumns ondeployment_containersanddeployment_config0.0.0.0(was127.0.0.1) so they're reachable from worker nodes via the private networkWeb UI
Test plan
cargo check --libcompiles cleanlytemps serve --private-address <wg-ip>persists the addresstemps join --token <token> --control-plane <url>registers a worker nodePOSTGRES_URLuses private IP)