Add idle container support (scale-to-zero)#197
Add idle container support (scale-to-zero)#197martijnenco wants to merge 1 commit intobasecamp:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds scale-to-zero (“idle container”) support to kamal-proxy by introducing a Docker socket client and an idle state machine that can stop containers after inactivity and wake them on demand, integrating the new behavior into service request handling, router reporting, CLI flags, and persisted state.
Changes:
- Introduce
DockerClient(Unix-socket HTTP client to Docker) andIdleController(Active/Sleeping/Waking state machine). - Integrate idle tracking/wake behavior into
Serviceand expose idle state viaRouter.ListActiveServices. - Add CLI/config plumbing for Docker socket path and idle deploy flags; add unit tests for Docker/idle components.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/server/testing.go | Update test server helper to pass Docker socket path into NewRouter. |
| internal/server/service_test.go | Update service creation helper for new NewService(..., docker) signature. |
| internal/server/service.go | Add idle options + integrate IdleController into request path, lifecycle, and JSON persistence. |
| internal/server/router_test.go | Update router tests for new NewRouter(statePath, dockerSocketPath) signature. |
| internal/server/router.go | Add router-level DockerClient, wire it into services and restored state, and expose idle state in listing. |
| internal/server/idle_controller_test.go | New tests covering idle->sleep and wake flow via a fake Docker API over a Unix socket. |
| internal/server/idle_controller.go | New idle state machine that stops/starts containers and waits for health on wake. |
| internal/server/docker_client_test.go | New tests validating Docker stop/start calls over Unix socket. |
| internal/server/docker_client.go | New minimal Docker HTTP client over a Unix socket. |
| internal/server/config.go | Add default Docker socket path constant and config field. |
| internal/cmd/run.go | Add --docker-socket flag and pass it into NewRouter. |
| internal/cmd/deploy.go | Add --idle-timeout / --idle-wake-timeout deploy flags. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| IdleTimeout time.Duration `json:"idle_timeout"` | ||
| IdleWakeTimeout time.Duration `json:"idle_wake_timeout"` |
There was a problem hiding this comment.
The new idle fields have no defaults/validation applied in ServiceOptions.Normalize(). If IdleTimeout > 0 but IdleWakeTimeout <= 0 (e.g., from older persisted state or a caller that doesn’t set the flag), wake requests can time out immediately. Consider normalizing IdleWakeTimeout to DefaultIdleWakeTimeout when idle is enabled (and rejecting negative durations).
| s.idleController.docker = s.docker | ||
| s.idleController.lb = s.active | ||
| s.idleController.serviceName = s.name | ||
| s.idleController.IdleTimeout = s.options.IdleTimeout | ||
| s.idleController.WakeTimeout = s.options.IdleWakeTimeout |
There was a problem hiding this comment.
initialize() updates IdleController internals (docker, lb, serviceName, timeouts) via direct assignment without taking the idle controller’s lock. Since the idle controller goroutine/request path can read these concurrently, this introduces data races. Prefer an IdleController.UpdateConfig(...) method that holds the lock and updates all mutable fields atomically.
| package server | ||
|
|
||
| import ( | ||
| "context" |
There was a problem hiding this comment.
The context import is unused in this test file, which will cause a compile failure. Remove the import (or use it if intended).
| "context" |
| ticker := time.NewTicker(10 * time.Second) | ||
| defer ticker.Stop() | ||
|
|
||
| for { | ||
| select { | ||
| case <-ic.closeChan: | ||
| return | ||
| case <-ticker.C: |
There was a problem hiding this comment.
run() uses a fixed time.NewTicker(10 * time.Second), so IdleTimeout values shorter than 10s can’t be honored (and longer ones can overshoot by up to the tick interval). Consider using a resettable time.Timer based on lastRequestAt/IdleTimeout, or deriving the tick interval from IdleTimeout with a sane min/max.
| ticker := time.NewTicker(10 * time.Second) | |
| defer ticker.Stop() | |
| for { | |
| select { | |
| case <-ic.closeChan: | |
| return | |
| case <-ticker.C: | |
| for { | |
| ic.lock.Lock() | |
| disabled := ic.disabled | |
| idleTimeout := ic.IdleTimeout | |
| lastRequestAt := ic.lastRequestAt | |
| ic.lock.Unlock() | |
| // If disabled or no idle timeout is configured, poll infrequently. | |
| if disabled || idleTimeout <= 0 { | |
| select { | |
| case <-ic.closeChan: | |
| return | |
| case <-time.After(time.Second): | |
| continue | |
| } | |
| } | |
| // Compute remaining time until we should consider the service idle. | |
| remaining := idleTimeout - time.Since(lastRequestAt) | |
| // If we've already exceeded the idle timeout, check immediately. | |
| if remaining <= 0 { | |
| select { | |
| case <-ic.closeChan: | |
| return | |
| default: | |
| } | |
| ic.checkIdle() | |
| continue | |
| } | |
| timer := time.NewTimer(remaining) | |
| select { | |
| case <-ic.closeChan: | |
| if !timer.Stop() { | |
| <-timer.C | |
| } | |
| return | |
| case <-timer.C: |
|
|
||
| if ic.State != IdleStateActive { | ||
| ic.State = IdleStateActive | ||
| if ic.wakeChan != nil { | ||
| close(ic.wakeChan) | ||
| ic.wakeChan = nil | ||
| } |
There was a problem hiding this comment.
UpdateContainers closes ic.wakeChan when moving back to active. The wake goroutine also closes ic.wakeChan, so this can cause a double-close panic under concurrent wake + deploy/update. Consider ensuring only one code path ever closes the channel (or guard channel ownership so it’s closed exactly once).
| if ic.State != IdleStateActive { | |
| ic.State = IdleStateActive | |
| if ic.wakeChan != nil { | |
| close(ic.wakeChan) | |
| ic.wakeChan = nil | |
| } | |
| if ic.State != IdleStateActive { | |
| ic.State = IdleStateActive |
| close(ic.wakeChan) | ||
| ic.wakeChan = nil | ||
| ic.lock.Unlock() |
There was a problem hiding this comment.
The wake goroutine unconditionally does close(ic.wakeChan). If wakeChan is nil (e.g., restored sleeping state) or already closed/cleared by another path (e.g., UpdateContainers), this will panic. Capture/clear the channel under lock and close it only if non-nil and still owned by this wake operation.
| close(ic.wakeChan) | |
| ic.wakeChan = nil | |
| ic.lock.Unlock() | |
| wakeChan := ic.wakeChan | |
| ic.wakeChan = nil | |
| ic.lock.Unlock() | |
| if wakeChan != nil { | |
| close(wakeChan) | |
| } |
| r.withWriteLock(func() error { | ||
| r.services = NewServiceMap() |
There was a problem hiding this comment.
RestoreLastSavedState calls r.withWriteLock(...) without checking/returning its error. If the callback fails, the error will be dropped and the method will still return nil. Capture the returned error and propagate it.
| s.active = lb | ||
| if s.idleController != nil { | ||
| s.idleController.lb = lb | ||
| s.idleController.UpdateContainers(lb.WriteTargets().Names()) | ||
| } |
There was a problem hiding this comment.
UpdateLoadBalancer mutates idleController.lb directly without taking the idle controller’s lock. The idle controller reads lb concurrently during wake (ic.lb.WaitUntilHealthy(...)), so this is a data race under concurrent deploy + traffic (and can trip -race). Add a setter on IdleController that updates lb under its lock, or update it inside IdleController.UpdateContainers(...) while holding the lock.
| for _, service := range services { | ||
| service.docker = r.dockerClient | ||
| service.initialize(service.options, service.targetOptions) | ||
| r.services.Set(service) |
There was a problem hiding this comment.
RestoreLastSavedState calls service.initialize(...) but ignores the returned error. Since initialize returns an error, this won’t compile, and even if it did, failed initialization (e.g., cert/middleware errors) would be silently swallowed. Capture the error and return it from the withWriteLock callback so state restoration fails loudly when a service can’t be re-initialized.
|
|
||
| ic.lastRequestAt = time.Now() | ||
| ic.closeChan = make(chan bool) | ||
|
|
There was a problem hiding this comment.
UnmarshalJSON restarts the controller goroutine but doesn’t initialize wakeChan when restoring into a sleeping/waking state. That can lead to wake() eventually closing a nil channel (panic) and/or WaitIfSleeping() not actually waiting. Initialize wakeChan appropriately on restore (or normalize restored state back to active).
| switch ic.State { | |
| case IdleStateSleeping, IdleStateWaking: | |
| ic.wakeChan = make(chan bool) | |
| default: | |
| // For IdleStateActive, wakeChan can remain nil until needed. | |
| } |
Add automatic idle container management to kamal-proxy: containers are stopped after a configurable period of inactivity and transparently restarted when new requests arrive.
What changed
New deploy command flags:
New components:
Integration into existing code:
Why
Review apps and staging environments often sit idle for hours or days, consuming memory for nothing. This feature lets kamal-proxy automatically reclaim those resources and bring containers back in seconds when someone visits the app -- similar to how Fly.io Machines, Railway, and Render handle scale-to-zero. The implementation keeps the proxy as the single point of control (no sidecar containers) and reuses the existing health check infrastructure for wake-up verification.
Dependent to this PR for kamal: basecamp/kamal#1800