Skip to content

Commit a8f0593

Browse files
devantlerclaude
andauthored
fix: keep ksail-desktop running without Docker and after closing the window (#5354)
* fix(k3d): never os.Exit the host when the Docker runtime is unavailable k3d's embedded cobra commands call logrus Fatal on their own logger (k3dlogger.Log(), not the standard logger) when the Docker daemon is unreachable, and logrus Fatal calls os.Exit(1) — terminating the whole host process. This crashed ksail-desktop on boot and `ksail ui` on its first request (cluster discovery enumerates k3d clusters) whenever Docker Desktop was not running, and would also crash a Docker-down cluster create/update. Neutralize it once in NewProvisioner by permanently replacing k3d's logger ExitFunc with one that panics a sentinel (ErrK3dRuntimeUnavailable), and route every site that runs a k3d command through a single recover seam, runK3dSafely, which converts that panic into a returned error and re-raises any other panic. Because the ExitFunc is process-global, the seam must cover every site: List (discovery), the create/delete/start/stop lifecycle commands, and the add/remove/list agent-node commands used by `cluster update` — including one that runs in its own goroutine, where an unrecovered panic would be uncatchable. silenceK3dLogging now only mutes output (the lifecycle/node paths keep k3d progress visible). Other distributions are unaffected: kind/talos/ vcluster use SDKs, and kwok's embedded kwokctl commands use cobra RunE (returning errors) — k3d is the only distribution embedding Fatal-calling commands. Adds real-runtime regression tests that point DOCKER_HOST at a dead socket and assert List and all seven k3d command sites return an error instead of exiting the process. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(desktop): hide window to tray on close instead of destroying it ksail-desktop is a menu-bar app, but Wails v3's default WindowClosing listener destroys the window when it is closed (red button / Cmd+W). Once destroyed, the tray's "Show KSail" only tries to recreate the window — a no-op on this Wails alpha, so nothing happens — and "Hide KSail" calls hide on the freed native window, crashing the whole process. Those are the two reported tray failures. Register a WindowClosing hook that cancels the close and hides the window instead. Hooks run before listeners and short-circuit them when the event is cancelled, so the destroying listener is skipped and the single window stays alive and reopenable for the app's lifetime. Quitting is unaffected: the tray's "Quit KSail" and Cmd+Q call app.Quit() directly, which never emits WindowClosing and so bypasses the hook. Adds a unit test for the hook's cancel-and-hide behavior (the hook body is split behind a small interface so it can be exercised without a live GUI window). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(k3d): restore os.Stdout after the command goroutine to avoid a data race The List and node-list paths redirect the process-global os.Stdout to a pipe to capture k3d's JSON output, run the k3d command in a goroutine, then restore os.Stdout. k3d's docker client reads os.Stdout (via moby/term.StdStreams) while building its CLI, so the restore on the main goroutine raced that read: io.Copy unblocking on pipe EOF is not a happens-before edge the race detector recognizes, and the restore ran before the errChan receive that actually synchronizes with the goroutine. CI runs the suite with -race and flagged it; the new real-runtime tests are the first to exercise these paths against the real k3d runtime and so exposed a pre-existing race. Move the errChan receive ahead of the os.Stdout restore in both defaultListClustersRaw and listAgentNodes so the restore happens-after the goroutine's stdout access. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 37833bb commit a8f0593

6 files changed

Lines changed: 384 additions & 32 deletions

File tree

desktop/main.go

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,11 @@ func run() error {
130130
func(event *application.ApplicationEvent) { handleDeepLink(window, event.Context().URL()) },
131131
)
132132

133+
// This is a menu-bar app: closing the window (red button / ⌘W / menu Close) hides it to the tray
134+
// instead of destroying it, so it stays reopenable via the tray's "Show KSail". Must be registered
135+
// before Run(). See hideWindowOnClose.
136+
hideWindowOnClose(window)
137+
133138
// A menu-bar/system-tray icon for quick access: show/hide the window or quit without going through
134139
// the Dock. Must be configured before Run() (which blocks until shutdown).
135140
installSystemTray(app, window)
@@ -170,3 +175,36 @@ func installSystemTray(app *application.App, window *application.WebviewWindow)
170175
trayMenu.Add("Quit KSail").OnClick(func(_ *application.Context) { app.Quit() })
171176
tray.SetMenu(trayMenu)
172177
}
178+
179+
// windowHider is the subset of *application.WebviewWindow the close-to-tray hook needs. Declaring it as
180+
// an interface lets the hook's behavior be unit-tested without a live Wails window (which needs a GUI).
181+
type windowHider interface {
182+
Hide() application.Window
183+
}
184+
185+
// hideWindowOnClose makes closing the window hide it to the menu bar/tray instead of destroying it, so
186+
// the single window stays alive and reopenable from the tray's "Show KSail" for the app's lifetime.
187+
// Quitting is explicit — the tray's "Quit KSail" (app.Quit) and ⌘Q both terminate the app process
188+
// directly, bypassing this hook.
189+
//
190+
// Wails v3's built-in WindowClosing handler is a *listener* that destroys the window: it marks it
191+
// destroyed, closes the native window, and removes it from the app's window registry. Once destroyed,
192+
// the tray's window.Show() can only try to recreate the window (a no-op on this alpha, so "nothing
193+
// happens"), and window.Hide() invokes hide on the freed native window, crashing the whole process —
194+
// exactly the two reported tray failures. A *hook* runs before listeners and, when it cancels the
195+
// event, short-circuits them (see WebviewWindow.HandleWindowEvent), so cancelling here skips the
196+
// destroy and we hide instead. On macOS the native windowShouldClose: already returns false (it never
197+
// closes the NSWindow itself; the Go listener did the destroying), so cancelling fully prevents it.
198+
func hideWindowOnClose(window *application.WebviewWindow) {
199+
window.RegisterHook(events.Common.WindowClosing, hideOnCloseHook(window))
200+
}
201+
202+
// hideOnCloseHook is the WindowClosing hook body: cancel the close (so Wails' destroying listener is
203+
// skipped) then hide the window. Split from hideWindowOnClose so it can be unit-tested against a fake
204+
// windowHider and a real application.WindowEvent.
205+
func hideOnCloseHook(window windowHider) func(*application.WindowEvent) {
206+
return func(event *application.WindowEvent) {
207+
event.Cancel()
208+
window.Hide()
209+
}
210+
}

desktop/window_close_test.go

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
package main
2+
3+
import (
4+
"testing"
5+
6+
"github.com/wailsapp/wails/v3/pkg/application"
7+
)
8+
9+
// fakeHider records Hide() calls; it stands in for *application.WebviewWindow so the close-to-tray
10+
// hook can be exercised without a live Wails window (which would need a GUI).
11+
type fakeHider struct{ hideCalls int }
12+
13+
func (f *fakeHider) Hide() application.Window {
14+
f.hideCalls++
15+
16+
return nil
17+
}
18+
19+
// TestHideOnCloseHook_CancelsCloseAndHides verifies the WindowClosing hook cancels the close — which
20+
// is what skips Wails' default window-destroying listener — and hides the window instead. Without the
21+
// cancel, Wails destroys the window on close, after which the tray's Show is a no-op and Hide crashes
22+
// the process (the two reported failures).
23+
func TestHideOnCloseHook_CancelsCloseAndHides(t *testing.T) {
24+
t.Parallel()
25+
26+
hider := &fakeHider{}
27+
event := application.NewWindowEvent()
28+
29+
hideOnCloseHook(hider)(event)
30+
31+
if !event.IsCancelled() {
32+
t.Error("close must be cancelled so Wails does not destroy the window")
33+
}
34+
35+
if hider.hideCalls != 1 {
36+
t.Errorf(
37+
"window must be hidden exactly once (close-to-tray); got %d Hide call(s)",
38+
hider.hideCalls,
39+
)
40+
}
41+
}

pkg/svc/provisioner/cluster/k3d/export_test.go

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,39 @@ import (
55
"time"
66

77
"github.com/devantler-tech/ksail/v7/pkg/runner"
8+
"github.com/devantler-tech/ksail/v7/pkg/svc/provisioner/cluster/clusterupdate"
89
k3kv1beta1 "github.com/rancher/k3k/pkg/apis/k3k.io/v1beta1"
910
corev1 "k8s.io/api/core/v1"
1011
)
1112

13+
// ListAgentNodesForTest exposes listAgentNodes (the pipe+goroutine node-list path)
14+
// so its runtime-down behavior can be exercised.
15+
func (k *Provisioner) ListAgentNodesForTest(ctx context.Context, clusterName string) error {
16+
_, err := k.listAgentNodes(ctx, clusterName)
17+
18+
return err
19+
}
20+
21+
// AddAgentNodesForTest exposes addAgentNodes (a synchronous node-command site) so
22+
// its runtime-down behavior can be exercised.
23+
func (k *Provisioner) AddAgentNodesForTest(
24+
ctx context.Context,
25+
clusterName string,
26+
count int,
27+
) error {
28+
return k.addAgentNodes(ctx, clusterName, count, &clusterupdate.UpdateResult{})
29+
}
30+
31+
// RemoveAgentNodesForTest exposes removeAgentNodes (a synchronous node-command
32+
// site) so its runtime-down behavior can be exercised.
33+
func (k *Provisioner) RemoveAgentNodesForTest(
34+
ctx context.Context,
35+
clusterName string,
36+
count int,
37+
) error {
38+
return k.removeAgentNodes(ctx, clusterName, count, &clusterupdate.UpdateResult{})
39+
}
40+
1241
// K3kReadyTimeoutForTest exposes k3kReadyTimeout for unit testing.
1342
func K3kReadyTimeoutForTest() time.Duration {
1443
return k3kReadyTimeout()

pkg/svc/provisioner/cluster/k3d/provisioner.go

Lines changed: 135 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ import (
44
"bytes"
55
"context"
66
"encoding/json"
7+
"errors"
78
"fmt"
89
"io"
910
"os"
@@ -18,10 +19,23 @@ import (
1819
"github.com/devantler-tech/ksail/v7/pkg/svc/provisioner/cluster/clusterupdate"
1920
clustercommand "github.com/k3d-io/k3d/v5/cmd/cluster"
2021
v1alpha5 "github.com/k3d-io/k3d/v5/pkg/config/v1alpha5"
22+
k3dlogger "github.com/k3d-io/k3d/v5/pkg/logger"
2123
"github.com/sirupsen/logrus"
2224
"github.com/spf13/cobra"
2325
)
2426

27+
// ErrK3dRuntimeUnavailable is returned by the k3d provisioner when the k3d
28+
// runtime (the Docker daemon) is unavailable. k3d's embedded cobra commands react
29+
// to an unreachable runtime by calling logrus Fatal on their own logger, which by
30+
// default calls os.Exit(1) and would terminate the whole host process — fatal for
31+
// a long-lived host such as the desktop app or `ksail ui`, which must keep running
32+
// when Docker Desktop is not started. NewProvisioner permanently rewires k3d's
33+
// logger to raise this sentinel via panic instead, and every site that runs a k3d
34+
// command routes through runK3dSafely, which recovers it into this error — so a
35+
// Docker-down list, lifecycle, or node-scaling operation fails gracefully rather
36+
// than crashing.
37+
var ErrK3dRuntimeUnavailable = errors.New("k3d runtime unavailable (is the Docker daemon running?)")
38+
2539
var (
2640
// listMutex protects concurrent access to os.Stdout during List operations.
2741
// This is required because k3d writes directly to os.Stdout before Cobra's output redirection takes effect.
@@ -78,6 +92,21 @@ func NewProvisioner(
7892
TimestampFormat: "2006-01-02T15:04:05Z",
7993
})
8094
logrus.SetLevel(logrus.InfoLevel)
95+
96+
// k3d logs through its OWN logrus instance (k3dlogger.Log(), not the standard
97+
// logger above), and its embedded cobra commands call logrus Fatal on it when
98+
// the runtime (Docker) is unreachable — Create alone has 19 such calls. logrus
99+
// Fatal invokes ExitFunc, which defaults to os.Exit(1) and would terminate the
100+
// whole host process: fatal for a long-lived host such as the desktop app or
101+
// `ksail ui`, where running create/delete/start/stop (or background discovery)
102+
// with Docker stopped must yield an error, not a crash. Permanently replace
103+
// ExitFunc with one that panics ErrK3dRuntimeUnavailable; every site that runs a
104+
// k3d command routes through runK3dSafely, which recovers it into an ordinary
105+
// error. Set once and never restored —
106+
// embedded k3d must never os.Exit the host, and a fixed value avoids races
107+
// between concurrent commands sharing this process-global logger (a long-running
108+
// Create must not race a quick List save/restore of the shared ExitFunc).
109+
k3dlogger.Log().ExitFunc = func(int) { panic(ErrK3dRuntimeUnavailable) }
81110
})
82111

83112
prov := &Provisioner{
@@ -235,25 +264,28 @@ func (k *Provisioner) SetComponentDetector(d *detector.ComponentDetector) {
235264
// JSON output. k3d's PrintClusters writes directly to os.Stdout using
236265
// fmt.Println (not Cobra's cmd.OutOrStdout()), so the output is captured by
237266
// temporarily redirecting os.Stdout.
267+
//
268+
// k3d reacts to an unreachable runtime (Docker down) by calling logrus Fatal,
269+
// which by default calls os.Exit(1) and would kill the whole host process. That
270+
// Fatal is neutralized once in NewProvisioner (k3dlogger ExitFunc → panic
271+
// ErrK3dRuntimeUnavailable); the list goroutine (runListCommandToPipe) recovers
272+
// the panic into a normal error return, and silenceK3dLogging keeps k3d's noise
273+
// off the console while discovery captures the cluster JSON.
238274
func (k *Provisioner) defaultListClustersRaw(ctx context.Context) (string, error) {
239-
// Lock first so that the logrus save/restore and the os.Stdout save/restore
275+
// Lock first so that the logger save/restore and the os.Stdout save/restore
240276
// are both protected by the same critical section. Without this ordering a
241-
// second concurrent caller could read originalLogOutput as io.Discard (the
277+
// second concurrent caller could read the saved output as io.Discard (the
242278
// value set by the first caller) and later restore to io.Discard, leaving
243-
// the global logrus logger permanently muted.
279+
// the global loggers permanently muted.
244280
listMutex.Lock()
245281

246-
// Temporarily redirect logrus to discard output during list
247-
// to prevent log messages from appearing in console.
248-
originalLogOutput := logrus.StandardLogger().Out
249-
250-
logrus.SetOutput(io.Discard)
282+
restoreLogging := silenceK3dLogging()
251283

252284
originalStdout := os.Stdout
253285

254286
pipeReader, pipeWriter, err := os.Pipe()
255287
if err != nil {
256-
logrus.SetOutput(originalLogOutput)
288+
restoreLogging()
257289
listMutex.Unlock()
258290

259291
return "", fmt.Errorf("cluster list: create stdout pipe: %w", err)
@@ -265,30 +297,28 @@ func (k *Provisioner) defaultListClustersRaw(ctx context.Context) (string, error
265297
// while the command is running (otherwise it may block on a full pipe buffer)
266298
errChan := make(chan error, 1)
267299

268-
go func() {
269-
_, runErr := k.runListCommand(ctx)
270-
// Close write end to signal EOF to the reader
271-
_ = pipeWriter.Close()
272-
273-
errChan <- runErr
274-
}()
300+
go k.runListCommandToPipe(ctx, pipeWriter, errChan)
275301

276302
// Read all output from the pipe (this is the JSON from k3d)
277303
var outputBuf bytes.Buffer
278304

279305
_, copyErr := io.Copy(&outputBuf, pipeReader)
280306
_ = pipeReader.Close()
281307

282-
// Restore stdout and logrus, then release the lock.
308+
// Wait for the command goroutine to finish BEFORE restoring os.Stdout. The
309+
// channel receive is the happens-before edge with the goroutine's access to
310+
// os.Stdout — k3d's docker client reads it via moby/term.StdStreams — so
311+
// restoring earlier races that read under the race detector. (The pipe EOF that
312+
// unblocked io.Copy is not a tracked synchronization edge.)
313+
runErr := <-errChan
314+
315+
// Restore stdout and the loggers, then release the lock.
283316
os.Stdout = originalStdout
284317

285-
logrus.SetOutput(originalLogOutput)
318+
restoreLogging()
286319

287320
listMutex.Unlock()
288321

289-
// Wait for command to complete and get any error
290-
runErr := <-errChan
291-
292322
if copyErr != nil {
293323
return "", fmt.Errorf("cluster list: read stdout pipe: %w", copyErr)
294324
}
@@ -300,6 +330,53 @@ func (k *Provisioner) defaultListClustersRaw(ctx context.Context) (string, error
300330
return strings.TrimSpace(outputBuf.String()), nil
301331
}
302332

333+
// runListCommandToPipe runs the k3d cluster-list command as its own goroutine (so
334+
// the caller can drain the pipe concurrently — k3d writes its JSON directly to
335+
// os.Stdout), closing pipeWriter to signal EOF and reporting the command's error
336+
// (or ErrK3dRuntimeUnavailable, via runK3dSafely) on errChan.
337+
func (k *Provisioner) runListCommandToPipe(
338+
ctx context.Context,
339+
pipeWriter *os.File,
340+
errChan chan<- error,
341+
) {
342+
// Always close the write end so the pipe reader (io.Copy) unblocks — whether
343+
// the command returns, errors, or its runtime-down Fatal is recovered.
344+
defer func() { _ = pipeWriter.Close() }()
345+
346+
errChan <- k.runK3dSafely(func() error {
347+
_, runErr := k.runListCommand(ctx)
348+
349+
return runErr
350+
})
351+
}
352+
353+
// silenceK3dLogging mutes both the standard logrus logger and k3d's dedicated
354+
// logger for the duration of a list, so k3d's progress lines and its (already
355+
// neutralized) runtime-down Fatal message do not leak to the console while
356+
// discovery captures the cluster JSON from os.Stdout. It returns a function that
357+
// restores both, which the caller must invoke before releasing listMutex.
358+
//
359+
// The process-exiting Fatal itself is neutralized once and for all in
360+
// NewProvisioner (k3dlogger ExitFunc → panic ErrK3dRuntimeUnavailable), not here;
361+
// this only silences output. k3d logs through its own logrus instance
362+
// (k3dlogger.Log()), not the standard logger, so muting only
363+
// logrus.StandardLogger() would not suppress k3d's output — both are handled.
364+
func silenceK3dLogging() func() {
365+
originalStdOut := logrus.StandardLogger().Out
366+
367+
logrus.SetOutput(io.Discard)
368+
369+
k3dLog := k3dlogger.Log()
370+
originalK3dOut := k3dLog.Out
371+
372+
k3dLog.SetOutput(io.Discard)
373+
374+
return func() {
375+
logrus.SetOutput(originalStdOut)
376+
k3dLog.SetOutput(originalK3dOut)
377+
}
378+
}
379+
303380
// runListCommand executes the k3d cluster list command and returns the output.
304381
func (k *Provisioner) runListCommand(ctx context.Context) (string, error) {
305382
cmd := clustercommand.NewCmdClusterList()
@@ -399,10 +476,46 @@ func (k *Provisioner) runLifecycleCommand(
399476
}
400477
}
401478

402-
_, runErr := k.runner.Run(ctx, cmd, args)
479+
runErr := k.runK3dSafely(func() error {
480+
_, runErr := k.runner.Run(ctx, cmd, args)
481+
482+
return runErr //nolint:wrapcheck // wrapped below with the operation prefix
483+
})
403484
if runErr != nil {
404485
return fmt.Errorf("%s: %w", errorPrefix, runErr)
405486
}
406487

407488
return nil
408489
}
490+
491+
// runK3dSafely runs action and converts the runtime-down panic raised by k3d's
492+
// neutralized Fatal (the ExitFunc set in NewProvisioner) into a returned
493+
// ErrK3dRuntimeUnavailable, so a Docker-down k3d command fails with an error
494+
// instead of terminating the host process (e.g. the desktop app, where these run
495+
// in background goroutines). It is the single recover seam every site that runs an
496+
// embedded k3d command routes through — List, the create/delete/start/stop
497+
// lifecycle commands, and the node create/delete/list commands in update.go — so
498+
// "embedded k3d can never exit the host" holds package-wide rather than as a
499+
// per-call-site checklist. A command may have partially run, so the panic is
500+
// surfaced as an error and execution does not continue past it; any non-sentinel
501+
// panic is a genuine bug and is re-raised.
502+
func (k *Provisioner) runK3dSafely(action func() error) (err error) {
503+
defer func() {
504+
recovered := recover()
505+
if recovered == nil {
506+
return
507+
}
508+
509+
recoveredErr, ok := recovered.(error)
510+
if ok && errors.Is(recoveredErr, ErrK3dRuntimeUnavailable) {
511+
err = ErrK3dRuntimeUnavailable
512+
513+
return
514+
}
515+
516+
// Not our sentinel: a genuine bug. Re-panic so it is not swallowed.
517+
panic(recovered)
518+
}()
519+
520+
return action()
521+
}

0 commit comments

Comments
 (0)