-
-
Notifications
You must be signed in to change notification settings - Fork 171
Open
Labels
Description
Bug Report: Data Race in Application Shutdown
Summary
There is a data race between application.stop() and application.terminate() during node shutdown. The race occurs because both methods access shared application state concurrently without proper synchronization.
Environment
- Ergo Framework Version: v1.999.310 (also reproduced in v320 branch commit 6914fac)
- Go Version: 1.25.5
- OS: macOS Darwin 25.2.0 (arm64), also reproducible on Linux
- Race Detector:
go test -race
Minimal Reproduction
package main
import (
"testing"
"time"
"ergo.services/ergo"
"ergo.services/ergo/act"
"ergo.services/ergo/gen"
)
// SimpleActor is a minimal actor for testing
type SimpleActor struct {
act.Actor
}
func (a *SimpleActor) Init(args ...any) error {
return nil
}
func CreateSimpleActor() gen.ProcessBehavior {
return &SimpleActor{}
}
// SimpleApp is a minimal application
type SimpleApp struct{}
func (a *SimpleApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
return gen.ApplicationSpec{
Name: "simple_app",
Mode: gen.ApplicationModeTransient,
Group: []gen.ApplicationMemberSpec{
{
Name: "simple_sup",
Factory: createSimpleSupervisor,
},
},
}, nil
}
func (a *SimpleApp) Start(mode gen.ApplicationMode) {}
func (a *SimpleApp) Terminate(reason error) {}
func createSimpleSupervisor() gen.ProcessBehavior {
return &SimpleSupervisor{}
}
type SimpleSupervisor struct {
act.Supervisor
}
func (s *SimpleSupervisor) Init(args ...any) (act.SupervisorSpec, error) {
return act.SupervisorSpec{
Type: act.SupervisorTypeOneForOne,
Children: []act.SupervisorChildSpec{
{Name: "worker1", Factory: CreateSimpleActor},
{Name: "worker2", Factory: CreateSimpleActor},
{Name: "worker3", Factory: CreateSimpleActor},
},
}, nil
}
func TestRaceCondition(t *testing.T) {
for i := 0; i < 10; i++ {
node, err := ergo.StartNode("test@localhost", gen.NodeOptions{
Applications: []gen.ApplicationBehavior{&SimpleApp{}},
Network: gen.NetworkOptions{
Mode: gen.NetworkModeDisabled,
},
})
if err != nil {
t.Fatalf("failed to start node: %v", err)
}
// Let the app initialize
time.Sleep(100 * time.Millisecond)
// This triggers the race
node.Stop()
node.Wait()
}
}Run with: go test -race -run TestRaceCondition
Stack Trace
==================
WARNING: DATA RACE
Read at 0x00c0000008a0 by goroutine 262:
ergo.services/ergo/node.(*application).terminate()
/Users/kevin/go/pkg/mod/ergo.services/[email protected]/node/application.go:192 +0x638
ergo.services/ergo/node.(*node).unregisterProcess()
/Users/kevin/go/pkg/mod/ergo.services/[email protected]/node/node.go:1823 +0x6a8
ergo.services/ergo/node.(*process).run.func1()
/Users/kevin/go/pkg/mod/ergo.services/[email protected]/node/process.go:1701 +0x638
Previous write at 0x00c0000008a0 by goroutine 233:
ergo.services/ergo/node.(*application).stop()
/Users/kevin/go/pkg/mod/ergo.services/[email protected]/node/application.go:135 +0x178
ergo.services/ergo/node.(*node).stop.func1()
/Users/kevin/go/pkg/mod/ergo.services/[email protected]/node/node.go:803 +0x2c0
...
==================
Root Cause Analysis
The race occurs because of a check-then-act pattern without proper synchronization:
Flow Leading to Race
node.Stop()is callednode.stop()iterates over applications viasync.Map.Range()- For each application,
application.stop()is called stop()sendsTerminateReasonShutdownto all child processes- Child processes terminate and trigger
application.terminate()viaunregisterProcess() - RACE:
stop()is still modifying application state whileterminate()reads it
Specific Code Paths
Writer (application.stop - line ~135):
func (a *application) stop(force bool, timeout time.Duration) error {
// ...
a.mode = gen.ApplicationModeTemporary // WRITE
a.group.Range(func(pid gen.PID, _ bool) bool {
if force {
a.node.Kill(pid)
} else {
a.node.SendExit(pid, gen.TerminateReasonShutdown) // Triggers terminate()
}
return true
})
a.reason = reason // WRITE
// ...
}Reader (application.terminate - line ~192):
func (a *application) terminate(p *process, reason error) {
// Called from a different goroutine when process dies
state := atomic.LoadInt32(&a.state) // READ
// ...
mode := a.mode // READ - races with write in stop()
// ...
}Why Atomic Operations Aren't Enough
The code uses atomic.CompareAndSwapInt32 for state transitions, but:
- Compound operations aren't atomic: Checking state, then iterating group, then modifying mode/reason spans multiple operations
- Multiple fields accessed:
state,mode,reason,groupare all involved - Cross-goroutine access:
terminate()runs in process goroutines whilestop()runs in the caller's goroutine
Suggested Fix
Add a mutex to protect compound operations:
type application struct {
mu sync.Mutex // ADD: Protects compound state operations
spec gen.ApplicationSpec
node *node
behavior gen.ApplicationBehavior
group lib.Map[gen.PID, bool]
mode gen.ApplicationMode
state int32
stopped chan struct{}
started int64
parent gen.Atom
reason error
}
func (a *application) stop(force bool, timeout time.Duration) error {
a.mu.Lock() // ADD
if !atomic.CompareAndSwapInt32(&a.state,
int32(gen.ApplicationStateRunning),
int32(gen.ApplicationStateStopping)) {
a.mu.Unlock() // ADD
return gen.ErrApplicationStopping
}
a.mode = gen.ApplicationModeTemporary
var reason error
if force {
reason = gen.TerminateReasonKill
} else {
reason = gen.TerminateReasonShutdown
}
a.group.Range(func(pid gen.PID, _ bool) bool {
if force {
a.node.Kill(pid)
} else {
a.node.SendExit(pid, gen.TerminateReasonShutdown)
}
return true
})
a.reason = reason
a.mu.Unlock() // ADD
// Wait for completion...
}
func (a *application) terminate(p *process, reason error) {
a.mu.Lock() // ADD
defer a.mu.Unlock() // ADD
// Now safe to access a.mode, a.state, etc.
// ...
}Alternative: Use RWMutex for Better Performance
type application struct {
mu sync.RWMutex // Writers: stop(), Readers: terminate() status checks
// ...
}
func (a *application) terminate(p *process, reason error) {
a.mu.RLock()
mode := a.mode
state := atomic.LoadInt32(&a.state)
a.mu.RUnlock()
// Use local copies...
}Impact
- Severity: Medium - can cause undefined behavior during shutdown
- Frequency: Occurs on every node shutdown with active applications
- Workaround: Use
node.StopForce()instead ofnode.Stop(), but this has its own race withspawn()
Additional Race: spawn() vs StopForce()
There's a related race between node.spawn() and node.StopForce():
WARNING: DATA RACE
Write at 0x00c0003ae398 by goroutine 53:
sync/atomic.StoreInt64()
ergo.services/ergo/node.(*node).StopForce()
node/node.go:785 +0x2c
Previous read at 0x00c0003ae398 by goroutine 75:
ergo.services/ergo/node.(*node).spawn()
node/node.go:1624 +0xfa8
This occurs when a supervisor tries to restart a child during shutdown. The fix should check for shutdown state before spawning:
func (n *node) spawn(...) (...) {
if atomic.LoadInt64(&n.creation) == 0 {
return ..., ErrNodeTerminated
}
// ...
}References
- Go Race Detector: https://go.dev/doc/articles/race_detector
- Related discussion: Data races in concurrent state machines
Contact
- Reporter: VPNHoodErgo project
- Date: 2026-01-06