Skip to content

Data Race in Application Shutdown: stop() vs terminate() #246

@wayahead

Description

@wayahead

Bug Report: Data Race in Application Shutdown

Summary

There is a data race between application.stop() and application.terminate() during node shutdown. The race occurs because both methods access shared application state concurrently without proper synchronization.

Environment

  • Ergo Framework Version: v1.999.310 (also reproduced in v320 branch commit 6914fac)
  • Go Version: 1.25.5
  • OS: macOS Darwin 25.2.0 (arm64), also reproducible on Linux
  • Race Detector: go test -race

Minimal Reproduction

package main

import (
	"testing"
	"time"

	"ergo.services/ergo"
	"ergo.services/ergo/act"
	"ergo.services/ergo/gen"
)

// SimpleActor is a minimal actor for testing
type SimpleActor struct {
	act.Actor
}

func (a *SimpleActor) Init(args ...any) error {
	return nil
}

func CreateSimpleActor() gen.ProcessBehavior {
	return &SimpleActor{}
}

// SimpleApp is a minimal application
type SimpleApp struct{}

func (a *SimpleApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
	return gen.ApplicationSpec{
		Name: "simple_app",
		Mode: gen.ApplicationModeTransient,
		Group: []gen.ApplicationMemberSpec{
			{
				Name:    "simple_sup",
				Factory: createSimpleSupervisor,
			},
		},
	}, nil
}

func (a *SimpleApp) Start(mode gen.ApplicationMode) {}
func (a *SimpleApp) Terminate(reason error)         {}

func createSimpleSupervisor() gen.ProcessBehavior {
	return &SimpleSupervisor{}
}

type SimpleSupervisor struct {
	act.Supervisor
}

func (s *SimpleSupervisor) Init(args ...any) (act.SupervisorSpec, error) {
	return act.SupervisorSpec{
		Type: act.SupervisorTypeOneForOne,
		Children: []act.SupervisorChildSpec{
			{Name: "worker1", Factory: CreateSimpleActor},
			{Name: "worker2", Factory: CreateSimpleActor},
			{Name: "worker3", Factory: CreateSimpleActor},
		},
	}, nil
}

func TestRaceCondition(t *testing.T) {
	for i := 0; i < 10; i++ {
		node, err := ergo.StartNode("test@localhost", gen.NodeOptions{
			Applications: []gen.ApplicationBehavior{&SimpleApp{}},
			Network: gen.NetworkOptions{
				Mode: gen.NetworkModeDisabled,
			},
		})
		if err != nil {
			t.Fatalf("failed to start node: %v", err)
		}

		// Let the app initialize
		time.Sleep(100 * time.Millisecond)

		// This triggers the race
		node.Stop()
		node.Wait()
	}
}

Run with: go test -race -run TestRaceCondition

Stack Trace

==================
WARNING: DATA RACE
Read at 0x00c0000008a0 by goroutine 262:
  ergo.services/ergo/node.(*application).terminate()
      /Users/kevin/go/pkg/mod/ergo.services/[email protected]/node/application.go:192 +0x638
  ergo.services/ergo/node.(*node).unregisterProcess()
      /Users/kevin/go/pkg/mod/ergo.services/[email protected]/node/node.go:1823 +0x6a8
  ergo.services/ergo/node.(*process).run.func1()
      /Users/kevin/go/pkg/mod/ergo.services/[email protected]/node/process.go:1701 +0x638

Previous write at 0x00c0000008a0 by goroutine 233:
  ergo.services/ergo/node.(*application).stop()
      /Users/kevin/go/pkg/mod/ergo.services/[email protected]/node/application.go:135 +0x178
  ergo.services/ergo/node.(*node).stop.func1()
      /Users/kevin/go/pkg/mod/ergo.services/[email protected]/node/node.go:803 +0x2c0
  ...
==================

Root Cause Analysis

The race occurs because of a check-then-act pattern without proper synchronization:

Flow Leading to Race

  1. node.Stop() is called
  2. node.stop() iterates over applications via sync.Map.Range()
  3. For each application, application.stop() is called
  4. stop() sends TerminateReasonShutdown to all child processes
  5. Child processes terminate and trigger application.terminate() via unregisterProcess()
  6. RACE: stop() is still modifying application state while terminate() reads it

Specific Code Paths

Writer (application.stop - line ~135):

func (a *application) stop(force bool, timeout time.Duration) error {
    // ...
    a.mode = gen.ApplicationModeTemporary  // WRITE
    a.group.Range(func(pid gen.PID, _ bool) bool {
        if force {
            a.node.Kill(pid)
        } else {
            a.node.SendExit(pid, gen.TerminateReasonShutdown)  // Triggers terminate()
        }
        return true
    })
    a.reason = reason  // WRITE
    // ...
}

Reader (application.terminate - line ~192):

func (a *application) terminate(p *process, reason error) {
    // Called from a different goroutine when process dies
    state := atomic.LoadInt32(&a.state)  // READ
    // ...
    mode := a.mode  // READ - races with write in stop()
    // ...
}

Why Atomic Operations Aren't Enough

The code uses atomic.CompareAndSwapInt32 for state transitions, but:

  1. Compound operations aren't atomic: Checking state, then iterating group, then modifying mode/reason spans multiple operations
  2. Multiple fields accessed: state, mode, reason, group are all involved
  3. Cross-goroutine access: terminate() runs in process goroutines while stop() runs in the caller's goroutine

Suggested Fix

Add a mutex to protect compound operations:

type application struct {
    mu       sync.Mutex  // ADD: Protects compound state operations

    spec     gen.ApplicationSpec
    node     *node
    behavior gen.ApplicationBehavior
    group    lib.Map[gen.PID, bool]
    mode     gen.ApplicationMode
    state    int32
    stopped  chan struct{}
    started  int64
    parent   gen.Atom
    reason   error
}

func (a *application) stop(force bool, timeout time.Duration) error {
    a.mu.Lock()  // ADD

    if !atomic.CompareAndSwapInt32(&a.state,
        int32(gen.ApplicationStateRunning),
        int32(gen.ApplicationStateStopping)) {
        a.mu.Unlock()  // ADD
        return gen.ErrApplicationStopping
    }

    a.mode = gen.ApplicationModeTemporary

    var reason error
    if force {
        reason = gen.TerminateReasonKill
    } else {
        reason = gen.TerminateReasonShutdown
    }

    a.group.Range(func(pid gen.PID, _ bool) bool {
        if force {
            a.node.Kill(pid)
        } else {
            a.node.SendExit(pid, gen.TerminateReasonShutdown)
        }
        return true
    })

    a.reason = reason
    a.mu.Unlock()  // ADD

    // Wait for completion...
}

func (a *application) terminate(p *process, reason error) {
    a.mu.Lock()  // ADD
    defer a.mu.Unlock()  // ADD

    // Now safe to access a.mode, a.state, etc.
    // ...
}

Alternative: Use RWMutex for Better Performance

type application struct {
    mu sync.RWMutex  // Writers: stop(), Readers: terminate() status checks
    // ...
}

func (a *application) terminate(p *process, reason error) {
    a.mu.RLock()
    mode := a.mode
    state := atomic.LoadInt32(&a.state)
    a.mu.RUnlock()

    // Use local copies...
}

Impact

  • Severity: Medium - can cause undefined behavior during shutdown
  • Frequency: Occurs on every node shutdown with active applications
  • Workaround: Use node.StopForce() instead of node.Stop(), but this has its own race with spawn()

Additional Race: spawn() vs StopForce()

There's a related race between node.spawn() and node.StopForce():

WARNING: DATA RACE
Write at 0x00c0003ae398 by goroutine 53:
  sync/atomic.StoreInt64()
  ergo.services/ergo/node.(*node).StopForce()
      node/node.go:785 +0x2c

Previous read at 0x00c0003ae398 by goroutine 75:
  ergo.services/ergo/node.(*node).spawn()
      node/node.go:1624 +0xfa8

This occurs when a supervisor tries to restart a child during shutdown. The fix should check for shutdown state before spawning:

func (n *node) spawn(...) (...) {
    if atomic.LoadInt64(&n.creation) == 0 {
        return ..., ErrNodeTerminated
    }
    // ...
}

References

Contact

  • Reporter: VPNHoodErgo project
  • Date: 2026-01-06

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions