Skip to content

Add idle container support (scale-to-zero)#197

Draft
martijnenco wants to merge 1 commit intobasecamp:mainfrom
martijnenco:feature/add-support-for-review-apps
Draft

Add idle container support (scale-to-zero)#197
martijnenco wants to merge 1 commit intobasecamp:mainfrom
martijnenco:feature/add-support-for-review-apps

Conversation

@martijnenco
Copy link
Copy Markdown

@martijnenco martijnenco commented Mar 12, 2026

Add automatic idle container management to kamal-proxy: containers are stopped after a configurable period of inactivity and transparently restarted when new requests arrive.

What changed
New deploy command flags:

  • --idle-timeout -- duration of inactivity before the container is stopped (e.g. 300s). Defaults to 0 (disabled).
  • --idle-wake-timeout -- maximum time to hold an incoming request while the container is starting back up. Defaults to 30s.

New components:

  • DockerClient (docker_client.go) -- lightweight HTTP client that communicates with the Docker daemon over a Unix socket (/var/run/docker.sock) to stop and start containers. No external Docker SDK dependency.
  • IdleController (idle_controller.go) -- per-service state machine (Active -> Sleeping -> Waking -> Active) that tracks request activity, puts containers to sleep after the idle timeout, and wakes them on demand.

Integration into existing code:

  • Service now hosts an IdleController alongside the existing PauseController. Every proxied request updates the idle timer. Health check requests (/up) do not count as activity and do not trigger a wake.
  • Router.ListActiveServices reports idle state (sleeping, waking) so operators can see which services are asleep via kamal-proxy list.
  • The run command accepts a --docker-socket flag (defaulting to /var/run/docker.sock) to configure the socket path.
  • State is persisted in the existing state file so idle services survive proxy restarts.

Why
Review apps and staging environments often sit idle for hours or days, consuming memory for nothing. This feature lets kamal-proxy automatically reclaim those resources and bring containers back in seconds when someone visits the app -- similar to how Fly.io Machines, Railway, and Render handle scale-to-zero. The implementation keeps the proxy as the single point of control (no sidecar containers) and reuses the existing health check infrastructure for wake-up verification.

Dependent to this PR for kamal: basecamp/kamal#1800

Copilot AI review requested due to automatic review settings March 12, 2026 09:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds scale-to-zero (“idle container”) support to kamal-proxy by introducing a Docker socket client and an idle state machine that can stop containers after inactivity and wake them on demand, integrating the new behavior into service request handling, router reporting, CLI flags, and persisted state.

Changes:

  • Introduce DockerClient (Unix-socket HTTP client to Docker) and IdleController (Active/Sleeping/Waking state machine).
  • Integrate idle tracking/wake behavior into Service and expose idle state via Router.ListActiveServices.
  • Add CLI/config plumbing for Docker socket path and idle deploy flags; add unit tests for Docker/idle components.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
internal/server/testing.go Update test server helper to pass Docker socket path into NewRouter.
internal/server/service_test.go Update service creation helper for new NewService(..., docker) signature.
internal/server/service.go Add idle options + integrate IdleController into request path, lifecycle, and JSON persistence.
internal/server/router_test.go Update router tests for new NewRouter(statePath, dockerSocketPath) signature.
internal/server/router.go Add router-level DockerClient, wire it into services and restored state, and expose idle state in listing.
internal/server/idle_controller_test.go New tests covering idle->sleep and wake flow via a fake Docker API over a Unix socket.
internal/server/idle_controller.go New idle state machine that stops/starts containers and waits for health on wake.
internal/server/docker_client_test.go New tests validating Docker stop/start calls over Unix socket.
internal/server/docker_client.go New minimal Docker HTTP client over a Unix socket.
internal/server/config.go Add default Docker socket path constant and config field.
internal/cmd/run.go Add --docker-socket flag and pass it into NewRouter.
internal/cmd/deploy.go Add --idle-timeout / --idle-wake-timeout deploy flags.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +96 to +97
IdleTimeout time.Duration `json:"idle_timeout"`
IdleWakeTimeout time.Duration `json:"idle_wake_timeout"`
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new idle fields have no defaults/validation applied in ServiceOptions.Normalize(). If IdleTimeout > 0 but IdleWakeTimeout <= 0 (e.g., from older persisted state or a caller that doesn’t set the flag), wake requests can time out immediately. Consider normalizing IdleWakeTimeout to DefaultIdleWakeTimeout when idle is enabled (and rejecting negative durations).

Copilot uses AI. Check for mistakes.
Comment on lines +387 to +391
s.idleController.docker = s.docker
s.idleController.lb = s.active
s.idleController.serviceName = s.name
s.idleController.IdleTimeout = s.options.IdleTimeout
s.idleController.WakeTimeout = s.options.IdleWakeTimeout
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize() updates IdleController internals (docker, lb, serviceName, timeouts) via direct assignment without taking the idle controller’s lock. Since the idle controller goroutine/request path can read these concurrently, this introduces data races. Prefer an IdleController.UpdateConfig(...) method that holds the lock and updates all mutable fields atomically.

Copilot uses AI. Check for mistakes.
package server

import (
"context"
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context import is unused in this test file, which will cause a compile failure. Remove the import (or use it if intended).

Suggested change
"context"

Copilot uses AI. Check for mistakes.
Comment on lines +169 to +176
ticker := time.NewTicker(10 * time.Second)
defer ticker.Stop()

for {
select {
case <-ic.closeChan:
return
case <-ticker.C:
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run() uses a fixed time.NewTicker(10 * time.Second), so IdleTimeout values shorter than 10s can’t be honored (and longer ones can overshoot by up to the tick interval). Consider using a resettable time.Timer based on lastRequestAt/IdleTimeout, or deriving the tick interval from IdleTimeout with a sane min/max.

Suggested change
ticker := time.NewTicker(10 * time.Second)
defer ticker.Stop()
for {
select {
case <-ic.closeChan:
return
case <-ticker.C:
for {
ic.lock.Lock()
disabled := ic.disabled
idleTimeout := ic.IdleTimeout
lastRequestAt := ic.lastRequestAt
ic.lock.Unlock()
// If disabled or no idle timeout is configured, poll infrequently.
if disabled || idleTimeout <= 0 {
select {
case <-ic.closeChan:
return
case <-time.After(time.Second):
continue
}
}
// Compute remaining time until we should consider the service idle.
remaining := idleTimeout - time.Since(lastRequestAt)
// If we've already exceeded the idle timeout, check immediately.
if remaining <= 0 {
select {
case <-ic.closeChan:
return
default:
}
ic.checkIdle()
continue
}
timer := time.NewTimer(remaining)
select {
case <-ic.closeChan:
if !timer.Stop() {
<-timer.C
}
return
case <-timer.C:

Copilot uses AI. Check for mistakes.
Comment on lines +141 to +147

if ic.State != IdleStateActive {
ic.State = IdleStateActive
if ic.wakeChan != nil {
close(ic.wakeChan)
ic.wakeChan = nil
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateContainers closes ic.wakeChan when moving back to active. The wake goroutine also closes ic.wakeChan, so this can cause a double-close panic under concurrent wake + deploy/update. Consider ensuring only one code path ever closes the channel (or guard channel ownership so it’s closed exactly once).

Suggested change
if ic.State != IdleStateActive {
ic.State = IdleStateActive
if ic.wakeChan != nil {
close(ic.wakeChan)
ic.wakeChan = nil
}
if ic.State != IdleStateActive {
ic.State = IdleStateActive

Copilot uses AI. Check for mistakes.
Comment on lines +248 to +250
close(ic.wakeChan)
ic.wakeChan = nil
ic.lock.Unlock()
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wake goroutine unconditionally does close(ic.wakeChan). If wakeChan is nil (e.g., restored sleeping state) or already closed/cleared by another path (e.g., UpdateContainers), this will panic. Capture/clear the channel under lock and close it only if non-nil and still owned by this wake operation.

Suggested change
close(ic.wakeChan)
ic.wakeChan = nil
ic.lock.Unlock()
wakeChan := ic.wakeChan
ic.wakeChan = nil
ic.lock.Unlock()
if wakeChan != nil {
close(wakeChan)
}

Copilot uses AI. Check for mistakes.
Comment on lines 82 to 83
r.withWriteLock(func() error {
r.services = NewServiceMap()
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RestoreLastSavedState calls r.withWriteLock(...) without checking/returning its error. If the callback fails, the error will be dropped and the method will still return nil. Capture the returned error and propagate it.

Copilot uses AI. Check for mistakes.
Comment on lines 188 to +192
s.active = lb
if s.idleController != nil {
s.idleController.lb = lb
s.idleController.UpdateContainers(lb.WriteTargets().Names())
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateLoadBalancer mutates idleController.lb directly without taking the idle controller’s lock. The idle controller reads lb concurrently during wake (ic.lb.WaitUntilHealthy(...)), so this is a data race under concurrent deploy + traffic (and can trip -race). Add a setter on IdleController that updates lb under its lock, or update it inside IdleController.UpdateContainers(...) while holding the lock.

Copilot uses AI. Check for mistakes.
Comment on lines 84 to 87
for _, service := range services {
service.docker = r.dockerClient
service.initialize(service.options, service.targetOptions)
r.services.Set(service)
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RestoreLastSavedState calls service.initialize(...) but ignores the returned error. Since initialize returns an error, this won’t compile, and even if it did, failed initialization (e.g., cert/middleware errors) would be silently swallowed. Capture the error and return it from the withWriteLock callback so state restoration fails loudly when a service can’t be re-initialized.

Copilot uses AI. Check for mistakes.

ic.lastRequestAt = time.Now()
ic.closeChan = make(chan bool)

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnmarshalJSON restarts the controller goroutine but doesn’t initialize wakeChan when restoring into a sleeping/waking state. That can lead to wake() eventually closing a nil channel (panic) and/or WaitIfSleeping() not actually waiting. Initialize wakeChan appropriately on restore (or normalize restored state back to active).

Suggested change
switch ic.State {
case IdleStateSleeping, IdleStateWaking:
ic.wakeChan = make(chan bool)
default:
// For IdleStateActive, wakeChan can remain nil until needed.
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants