Data Race in Application Shutdown: stop() vs terminate()

# Bug Report: Data Race in Application Shutdown

## Summary

There is a data race between `application.stop()` and `application.terminate()` during node shutdown. The race occurs because both methods access shared application state concurrently without proper synchronization.

## Environment

- **Ergo Framework Version**: v1.999.310 (also reproduced in v320 branch commit 6914facc2282)
- **Go Version**: 1.25.5
- **OS**: macOS Darwin 25.2.0 (arm64), also reproducible on Linux
- **Race Detector**: `go test -race`

## Minimal Reproduction

```go
package main

import (
	"testing"
	"time"

	"ergo.services/ergo"
	"ergo.services/ergo/act"
	"ergo.services/ergo/gen"
)

// SimpleActor is a minimal actor for testing
type SimpleActor struct {
	act.Actor
}

func (a *SimpleActor) Init(args ...any) error {
	return nil
}

func CreateSimpleActor() gen.ProcessBehavior {
	return &SimpleActor{}
}

// SimpleApp is a minimal application
type SimpleApp struct{}

func (a *SimpleApp) Load(node gen.Node, args ...any) (gen.ApplicationSpec, error) {
	return gen.ApplicationSpec{
		Name: "simple_app",
		Mode: gen.ApplicationModeTransient,
		Group: []gen.ApplicationMemberSpec{
			{
				Name:    "simple_sup",
				Factory: createSimpleSupervisor,
			},
		},
	}, nil
}

func (a *SimpleApp) Start(mode gen.ApplicationMode) {}
func (a *SimpleApp) Terminate(reason error)         {}

func createSimpleSupervisor() gen.ProcessBehavior {
	return &SimpleSupervisor{}
}

type SimpleSupervisor struct {
	act.Supervisor
}

func (s *SimpleSupervisor) Init(args ...any) (act.SupervisorSpec, error) {
	return act.SupervisorSpec{
		Type: act.SupervisorTypeOneForOne,
		Children: []act.SupervisorChildSpec{
			{Name: "worker1", Factory: CreateSimpleActor},
			{Name: "worker2", Factory: CreateSimpleActor},
			{Name: "worker3", Factory: CreateSimpleActor},
		},
	}, nil
}

func TestRaceCondition(t *testing.T) {
	for i := 0; i < 10; i++ {
		node, err := ergo.StartNode("test@localhost", gen.NodeOptions{
			Applications: []gen.ApplicationBehavior{&SimpleApp{}},
			Network: gen.NetworkOptions{
				Mode: gen.NetworkModeDisabled,
			},
		})
		if err != nil {
			t.Fatalf("failed to start node: %v", err)
		}

		// Let the app initialize
		time.Sleep(100 * time.Millisecond)

		// This triggers the race
		node.Stop()
		node.Wait()
	}
}
```

Run with: `go test -race -run TestRaceCondition`

## Stack Trace

```
==================
WARNING: DATA RACE
Read at 0x00c0000008a0 by goroutine 262:
  ergo.services/ergo/node.(*application).terminate()
      /Users/kevin/go/pkg/mod/ergo.services/ergo@v1.999.310/node/application.go:192 +0x638
  ergo.services/ergo/node.(*node).unregisterProcess()
      /Users/kevin/go/pkg/mod/ergo.services/ergo@v1.999.310/node/node.go:1823 +0x6a8
  ergo.services/ergo/node.(*process).run.func1()
      /Users/kevin/go/pkg/mod/ergo.services/ergo@v1.999.310/node/process.go:1701 +0x638

Previous write at 0x00c0000008a0 by goroutine 233:
  ergo.services/ergo/node.(*application).stop()
      /Users/kevin/go/pkg/mod/ergo.services/ergo@v1.999.310/node/application.go:135 +0x178
  ergo.services/ergo/node.(*node).stop.func1()
      /Users/kevin/go/pkg/mod/ergo.services/ergo@v1.999.310/node/node.go:803 +0x2c0
  ...
==================
```

## Root Cause Analysis

The race occurs because of a **check-then-act pattern** without proper synchronization:

### Flow Leading to Race

1. `node.Stop()` is called
2. `node.stop()` iterates over applications via `sync.Map.Range()`
3. For each application, `application.stop()` is called
4. `stop()` sends `TerminateReasonShutdown` to all child processes
5. Child processes terminate and trigger `application.terminate()` via `unregisterProcess()`
6. **RACE**: `stop()` is still modifying application state while `terminate()` reads it

### Specific Code Paths

**Writer (application.stop - line ~135)**:
```go
func (a *application) stop(force bool, timeout time.Duration) error {
    // ...
    a.mode = gen.ApplicationModeTemporary  // WRITE
    a.group.Range(func(pid gen.PID, _ bool) bool {
        if force {
            a.node.Kill(pid)
        } else {
            a.node.SendExit(pid, gen.TerminateReasonShutdown)  // Triggers terminate()
        }
        return true
    })
    a.reason = reason  // WRITE
    // ...
}
```

**Reader (application.terminate - line ~192)**:
```go
func (a *application) terminate(p *process, reason error) {
    // Called from a different goroutine when process dies
    state := atomic.LoadInt32(&a.state)  // READ
    // ...
    mode := a.mode  // READ - races with write in stop()
    // ...
}
```

### Why Atomic Operations Aren't Enough

The code uses `atomic.CompareAndSwapInt32` for state transitions, but:

1. **Compound operations aren't atomic**: Checking state, then iterating group, then modifying mode/reason spans multiple operations
2. **Multiple fields accessed**: `state`, `mode`, `reason`, `group` are all involved
3. **Cross-goroutine access**: `terminate()` runs in process goroutines while `stop()` runs in the caller's goroutine

## Suggested Fix

Add a mutex to protect compound operations:

```go
type application struct {
    mu       sync.Mutex  // ADD: Protects compound state operations

    spec     gen.ApplicationSpec
    node     *node
    behavior gen.ApplicationBehavior
    group    lib.Map[gen.PID, bool]
    mode     gen.ApplicationMode
    state    int32
    stopped  chan struct{}
    started  int64
    parent   gen.Atom
    reason   error
}

func (a *application) stop(force bool, timeout time.Duration) error {
    a.mu.Lock()  // ADD

    if !atomic.CompareAndSwapInt32(&a.state,
        int32(gen.ApplicationStateRunning),
        int32(gen.ApplicationStateStopping)) {
        a.mu.Unlock()  // ADD
        return gen.ErrApplicationStopping
    }

    a.mode = gen.ApplicationModeTemporary

    var reason error
    if force {
        reason = gen.TerminateReasonKill
    } else {
        reason = gen.TerminateReasonShutdown
    }

    a.group.Range(func(pid gen.PID, _ bool) bool {
        if force {
            a.node.Kill(pid)
        } else {
            a.node.SendExit(pid, gen.TerminateReasonShutdown)
        }
        return true
    })

    a.reason = reason
    a.mu.Unlock()  // ADD

    // Wait for completion...
}

func (a *application) terminate(p *process, reason error) {
    a.mu.Lock()  // ADD
    defer a.mu.Unlock()  // ADD

    // Now safe to access a.mode, a.state, etc.
    // ...
}
```

### Alternative: Use RWMutex for Better Performance

```go
type application struct {
    mu sync.RWMutex  // Writers: stop(), Readers: terminate() status checks
    // ...
}

func (a *application) terminate(p *process, reason error) {
    a.mu.RLock()
    mode := a.mode
    state := atomic.LoadInt32(&a.state)
    a.mu.RUnlock()

    // Use local copies...
}
```

## Impact

- **Severity**: Medium - can cause undefined behavior during shutdown
- **Frequency**: Occurs on every node shutdown with active applications
- **Workaround**: Use `node.StopForce()` instead of `node.Stop()`, but this has its own race with `spawn()`

## Additional Race: spawn() vs StopForce()

There's a related race between `node.spawn()` and `node.StopForce()`:

```
WARNING: DATA RACE
Write at 0x00c0003ae398 by goroutine 53:
  sync/atomic.StoreInt64()
  ergo.services/ergo/node.(*node).StopForce()
      node/node.go:785 +0x2c

Previous read at 0x00c0003ae398 by goroutine 75:
  ergo.services/ergo/node.(*node).spawn()
      node/node.go:1624 +0xfa8
```

This occurs when a supervisor tries to restart a child during shutdown. The fix should check for shutdown state before spawning:

```go
func (n *node) spawn(...) (...) {
    if atomic.LoadInt64(&n.creation) == 0 {
        return ..., ErrNodeTerminated
    }
    // ...
}
```

## References

- Go Race Detector: https://go.dev/doc/articles/race_detector
- Related discussion: Data races in concurrent state machines

## Contact

- Reporter: VPNHoodErgo project
- Date: 2026-01-06


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Data Race in Application Shutdown: stop() vs terminate() #246

Bug Report: Data Race in Application Shutdown

Summary

Environment

Minimal Reproduction

Stack Trace

Root Cause Analysis

Flow Leading to Race

Specific Code Paths

Why Atomic Operations Aren't Enough

Suggested Fix

Alternative: Use RWMutex for Better Performance

Impact

Additional Race: spawn() vs StopForce()

References

Contact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Data Race in Application Shutdown: stop() vs terminate() #246

Description

Bug Report: Data Race in Application Shutdown

Summary

Environment

Minimal Reproduction

Stack Trace

Root Cause Analysis

Flow Leading to Race

Specific Code Paths

Why Atomic Operations Aren't Enough

Suggested Fix

Alternative: Use RWMutex for Better Performance

Impact

Additional Race: spawn() vs StopForce()

References

Contact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions