Skip to content

Commit 54a7558

Browse files
authored
DAOS-17427 control: Restart excluded rank after suicide (#16279) (#18422)
When an engine detects that it has been removed from the system group map by receiving a CART event, it will now notify its local control plane with a RAS engine_self_terminated event before terminating its own process. After receiving this self-termination event, the local control plane will restart the engine so it can rejoin the system. The goal of this change is to improve overall system resilience by automatically recovering engines that are excluded because of temporary issues such as network instability. Once the engines rejoin, the rank will still need to be reintegrated into pools as a separate follow‑up step. Rate-limiting prevents restart storms: a configurable minimum delay (default 300 seconds) between restarts per rank ensures system stability. Two new server config file parameters control behavior: disable_engine_auto_restart (boolean, default false) completely disables automatic restarts, while engine_auto_restart_min_delay (integer seconds) sets the minimum time between consecutive restart attempts. Functional tests for the automatic engine restart feature included with cases to verify disabling, rate-limiting and configuration support. Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
1 parent 6651cba commit 54a7558

32 files changed

Lines changed: 2750 additions & 50 deletions

docs/admin/administration.md

Lines changed: 99 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,8 @@ severity, message, description, and cause.
4949
| engine\_died| STATE\_CHANGE| ERROR| DAOS engine <idx\> exited exited unexpectedly: <error\> | Indicates engine instance <idx\> unexpectedly. <error> describes the exit state returned from exited daos\_engine process.| N/A |
5050
| engine\_asserted| STATE\_CHANGE| ERROR| TBD| Indicates engine instance <idx\> threw a runtime assertion, causing a crash. | An unexpected internal state resulted in assert failure. |
5151
| engine\_clock\_drift| INFO\_ONLY | ERROR| clock drift detected| Indicates CART comms layer has detected clock skew between engines.| NTP may not be syncing clocks across DAOS system. |
52+
| engine\_self\_terminated| INFO\_ONLY| NOTICE| excluded rank self terminated detected| Indicates that a DAOS engine rank has performed a self-termination due to having been excluded from the system's group map. The rank is automatically restarted by the control plane with rate-limiting (default: 5 minute minimum delay between restarts per rank) to prevent restart storms. | An engine was found to be in a transient non-functional state and excluded from the group map. The control plane monitors for this event and automatically restarts the affected engine so it can rejoin the system. Restarts are rate-limited per rank using the `engine_auto_restart_min_delay` configuration parameter. |
53+
| engine\_join\_failed| INFO\_ONLY| ERROR | DAOS engine <idx\> (rank <rank\>) was not allowed to join the system | Join operation failed for the given engine instance ID and rank (if assigned). | Reason should be provided in the extended info field of the event data. |
5254
| pool\_corruption\_detected| INFO\_ONLY| ERROR | Data corruption detected| Indicates a corruption in pool data has been detected. The event fields will contain pool and container UUIDs. | A corruption was found by the checksum scrubber. |
5355
| pool\_rebuild\_started| INFO\_ONLY| NOTICE | Pool rebuild started.| Indicates a pool rebuild has started. The event data field contains pool map version and pool operation identifier. | When a pool rank becomes unavailable a rebuild will be triggered. |
5456
| pool\_rebuild\_finished| INFO\_ONLY| NOTICE| Pool rebuild finished.| Indicates a pool rebuild has finished successfully. The event data field includes the pool map version and pool operation identifier. | N/A|
@@ -69,7 +71,6 @@ severity, message, description, and cause.
6971
| device\_plugged| INFO\_ONLY| NOTICE| Detected hot plugged device: <bdev-name\> | Indicates device was physically inserted into host. | NVMe SSD physically added to host. |
7072
| device\_replace| INFO\_ONLY| NOTICE or ERROR| Replaced device: <uuid\> with device: <uuid\> [failed: <rc\>] | Indicates that a faulty device was replaced with a new device and if the operation failed. The old and new device IDs as well as any non-zero return code are specified in the event data. | Device was replaced using DMG nvme replace command. |
7173
| system\_fabric\_provider\_changed| INFO\_ONLY| NOTICE| System fabric provider has changed: <old-provider\> -> <new-provider\>| Indicates that the system-wide fabric provider has been updated. No other specific information is included in event data.| A system-wide fabric provider change has been intentionally applied to all joined ranks.|
72-
| engine\_join\_failed| INFO\_ONLY| ERROR | DAOS engine <idx\> (rank <rank\>) was not allowed to join the system | Join operation failed for the given engine instance ID and rank (if assigned). | Reason should be provided in the extended info field of the event data. |
7374
| device\_link\_speed\_changed| INFO\_ONLY| NOTICE or WARNING| NVMe PCIe device at <pci-address\> port-<idx\>: link speed changed to <transfer-rate\> (max <transfer-rate\>)| Indicates that an NVMe device link speed has changed. The negotiated and maximum device link speeds are indicated in the event message field and the severity is set to warning if the negotiated speed is not at maximum capability (and notice level severity if at maximum). No other specific information is included in the event data.| Either device link speed was previously downgraded and has returned to maximum or link speed has downgraded to a value that is less than its maximum capability.|
7475
| device\_link\_width\_changed| INFO\_ONLY| NOTICE or WARNING| NVMe PCIe device at <pci-address\> port-<idx\>: link width changed to <pcie-link-lanes\> (max <pcie-link-lanes\>)| Indicates that an NVMe device link width has changed. The negotiated and maximum device link widths are indicated in the event message field and the severity is set to warning if the negotiated width is not at maximum capability (and notice level severity if at maximum). No other specific information is included in the event data.| Either device link width was previously downgraded and has returned to maximum or link width has downgraded to a value that is less than its maximum capability.|
7576
| device\_led\_set| INFO\_ONLY| NOTICE| LED on device <device\> set to state <state\>| Indicates that the LED state has been changed on a device. Device identifier and LED state are specified in the event message.| LED control command was issued to change device LED state for visual identification or fault indication.|
@@ -1007,6 +1008,94 @@ specified on the command line:
10071008
If the ranks were excluded from pools (e.g., unclean shutdown), they will need to
10081009
be reintegrated. Please see the pool operation section for more information.
10091010
1011+
### Engine Auto-Restart
1012+
1013+
DAOS automatically restarts engines that self-terminate after being excluded from
1014+
the system. This feature improves system availability by recovering from transient
1015+
failures without administrator intervention.
1016+
1017+
#### How It Works
1018+
1019+
When an engine is excluded (e.g., due to network issues detected by SWIM), the
1020+
engine detects the exclusion and performs a self-termination. The control plane
1021+
monitors for these events and automatically restarts the affected engine after
1022+
clearing the exclusion state, allowing it to rejoin the system.
1023+
1024+
The automatic restart includes rate-limiting to prevent restart storms. By default,
1025+
an engine must wait 5 minutes between automatic restarts.
1026+
1027+
#### Configuration
1028+
1029+
Control auto-restart behavior in `daos_server.yml`:
1030+
1031+
```yaml
1032+
# Disable automatic restart (default: enabled)
1033+
disable_engine_auto_restart: false
1034+
1035+
# Minimum delay between automatic restarts per rank (default: 300 seconds)
1036+
engine_auto_restart_min_delay: 300
1037+
```
1038+
1039+
#### Manual Operations
1040+
1041+
Manual `dmg system stop` and `dmg system start` operations are never affected by
1042+
the rate-limiting mechanism. Administrators can always immediately stop and start
1043+
ranks regardless of recent automatic restart activity.
1044+
1045+
```bash
1046+
# Manual operations always work immediately
1047+
$ dmg system stop --ranks=0,1,2
1048+
$ dmg system start --ranks=0,1,2
1049+
```
1050+
1051+
When you manually stop or start ranks, the restart history for those ranks is
1052+
automatically cleared, ensuring no delays from previous automatic restarts.
1053+
1054+
#### Monitoring
1055+
1056+
The `engine_self_terminated` RAS event is logged when an engine self-terminates
1057+
and triggers an automatic restart:
1058+
1059+
```
1060+
&&& RAS EVENT id: [engine_self_terminated] ... msg: [excluded rank self terminated detected]
1061+
```
1062+
1063+
Use `dmg system query` to check rank status and incarnation numbers. The
1064+
incarnation number increments each time a rank restarts, helping track restart
1065+
events:
1066+
1067+
```bash
1068+
$ dmg system query --ranks=0
1069+
Rank UUID Control Address Fault Domain State Reason Incarnation
1070+
---- ---- --------------- ------------- ----- ------ -----------
1071+
0 12345678-1234-1234-1234-123456789012 10.0.0.1:10001 /node1 Joined 3
1072+
```
1073+
1074+
#### Best Practices
1075+
1076+
- **Leave enabled**: Automatic restart improves availability for transient failures
1077+
- **Adjust timing**: For frequent exclusions, consider increasing `engine_auto_restart_min_delay`
1078+
- **Monitor events**: Watch for repeated `engine_self_terminated` events indicating persistent issues
1079+
- **Manual control**: Use `dmg system stop/start` for maintenance without worrying about delays
1080+
1081+
#### Troubleshooting
1082+
1083+
**Problem**: Rank keeps self-terminating and restarting
1084+
1085+
**Solution**: Investigate root cause:
1086+
1. Check network connectivity (SWIM may be detecting real failures)
1087+
2. Review engine logs for errors
1088+
3. Verify hardware health
1089+
4. Consider disabling auto-restart temporarily for investigation
1090+
1091+
**Problem**: Need immediate restart but recently auto-restarted
1092+
1093+
**Solution**: Use manual operations (not affected by rate-limiting):
1094+
```bash
1095+
$ dmg system stop --ranks=X
1096+
$ dmg system start --ranks=X
1097+
```
1098+
10101099
### Storage Reformat
10111100
10121101
To reformat the system after a controlled shutdown, run the command:
@@ -1052,15 +1141,15 @@ the storage server has not changed the old rank can be "reused" by formatting us
10521141
10531142
An examples workflow would be:
10541143
1055-
- `daos_server` is running and PMem NVDIMM fails causing an engine to enter excluded state.
1056-
- `daos_server` is stopped, storage server powered down, faulty PMem NVDIMM is replaced.
1057-
- After powering up storage server, `daos_server scm prepare` command is used to repair PMem.
1058-
- Storage server is rebooted after running `daos_server scm prepare` and command is run again.
1059-
- Now PMem is intact, clear with `wipefs -a /dev/pmemX` where "X" refers to the repaired PMem ID.
1060-
- `daos_server` can be started again. On start-up repaired engine prompts for "SCM format required".
1061-
- Run `dmg storage format --replace` to rejoin with existing rank (if --replace isn't used, a new
1062-
rank will be created).
1063-
- Formatted engine will join using the existing (old) rank which is mapped to the engine's hardware.
1144+
1. `daos_server` is running and PMem NVDIMM fails causing an engine to enter excluded state.
1145+
2. `daos_server` is stopped, storage server powered down, faulty PMem NVDIMM is replaced.
1146+
3. After powering up storage server, `daos_server scm prepare` command is used to repair PMem.
1147+
4. Storage server is rebooted after running `daos_server scm prepare` and command is run again.
1148+
5. Now PMem is intact, clear with `wipefs -a /dev/pmemX` where "X" refers to the repaired PMem ID.
1149+
6. `daos_server` can be started again. On start-up repaired engine prompts for "SCM format required".
1150+
7. Run `dmg storage format --replace` to rejoin with existing rank (if --replace isn't used, a new
1151+
rank will be created).
1152+
8. Formatted engine will join using the existing (old) rank which is mapped to the engine's hardware.
10641153
10651154
!!! note
10661155
`dmg storage format --replace` can be used to replace a rank in `AdminExcluded` state. The

docs/overview/fault.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,3 +84,68 @@ can now read from the rebuilt object shards.
8484

8585
This rebuild process is executed online while applications continue accessing
8686
and updating objects.
87+
88+
### Engine Self-Termination and Automatic Restart
89+
90+
A DAOS engine may be excluded from the group map because of inactivity
91+
for example. When an engine becomes aware of it's removal from the
92+
group map it will self-terminate to protect data integrity and system stability.
93+
94+
When an engine self terminates, it raises a `engine_self_terminated` RAS event
95+
(INFO_ONLY, NOTICE severity) containing the rank and incarnation information.
96+
The control plane automatically handles this event by:
97+
98+
1. Detecting the engine self terminated event through the RAS event system
99+
2. Identifying the engine instance associated with the rank
100+
3. Waiting for the engine process to fully stop
101+
4. Automatically restarting the engine to rejoin the system
102+
103+
This automatic restart mechanism is implemented in all control servers to ensure
104+
local engine recovery happens regardless of management service leadership state.
105+
The restarted engine will rejoin the system with a new incarnation number and
106+
resume normal operations.
107+
108+
This self-healing mechanism allows DAOS to automatically recover system
109+
membership state from transient engine failures without administrator
110+
intervention, improving overall system availability.
111+
112+
#### Rate Limiting
113+
114+
To prevent restart storms and ensure system stability, automatic engine restarts
115+
are rate-limited on a per-rank basis. By default, a minimum delay of 300 seconds
116+
(5 minutes) is enforced between consecutive restart attempts for the same rank.
117+
118+
When an engine self-terminates within the minimum delay period, the control plane
119+
schedules a deferred restart that will automatically trigger when the delay expires.
120+
If multiple self-termination events occur for the same rank during the delay period
121+
(this would be unexpected) only the most recent event triggers a deferred restart.
122+
This ensures the engine is restarted exactly once after the delay, regardless of
123+
how many self-termination events occur.
124+
125+
The rate-limiting interval can be customized by setting the
126+
`engine_auto_restart_min_delay` configuration option (in seconds) in the
127+
daos_server.yml file. For example:
128+
129+
```yaml
130+
engine_auto_restart_min_delay: 600 # 10 minutes between restarts
131+
```
132+
133+
This protection mechanism prevents scenarios where:
134+
- Repeated transient failures cause excessive restart cycling
135+
- A misconfigured engine continuously self-terminates
136+
- Cascading failures overwhelm the control plane with restart requests
137+
138+
#### Disabling Automatic Restart
139+
140+
The automatic restart behavior can be completely disabled by setting the
141+
`disable_engine_auto_restart` configuration option to `true` in the
142+
daos_server.yml file:
143+
144+
```yaml
145+
disable_engine_auto_restart: true
146+
```
147+
148+
When auto restart is disabled, engines that self-terminate will not be
149+
automatically restarted by the control plane, requiring manual intervention
150+
to restart the affected engine instances. This setting may be useful for
151+
debugging scenarios or when custom external restart management is preferred.

src/control/cmd/dmg/auto_test.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -606,6 +606,7 @@ mgmt_svc_replicas:
606606
- hostX:10002
607607
fault_cb: ""
608608
hyperthreads: false
609+
disable_engine_auto_restart: false
609610
`
610611
)
611612

src/control/events/ras.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ const (
4949
RASUnknownEvent RASID = C.RAS_UNKNOWN_EVENT
5050
RASEngineFormatRequired RASID = C.RAS_ENGINE_FORMAT_REQUIRED // notice
5151
RASEngineDied RASID = C.RAS_ENGINE_DIED // error
52+
RASEngineSelfTerminated RASID = C.RAS_ENGINE_SELF_TERMINATED // notice
5253
RASPoolRepsUpdate RASID = C.RAS_POOL_REPS_UPDATE // info
5354
RASSwimRankAlive RASID = C.RAS_SWIM_RANK_ALIVE // info
5455
RASSwimRankDead RASID = C.RAS_SWIM_RANK_DEAD // info

src/control/lib/control/event.go

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
//
22
// (C) Copyright 2021-2024 Intel Corporation.
3+
// (C) Copyright 2026 Hewlett Packard Enterprise Development LP
34
//
45
// SPDX-License-Identifier: BSD-2-Clause-Patent
56
//
@@ -170,7 +171,8 @@ func newEventLogger(logBasic logging.Logger, newSyslogger newSysloggerFn) *Event
170171
}
171172

172173
// NewEventLogger returns an initialized EventLogger capable of writing to the
173-
// supplied logger in addition to syslog.
174+
// supplied logger in addition to syslog. Should only be used in production code,
175+
// use MockEventLogger in unit tests.
174176
func NewEventLogger(log logging.Logger) *EventLogger {
175177
return newEventLogger(log, syslog.NewLogger)
176178
}

src/control/lib/control/mocks.go

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
//
22
// (C) Copyright 2020-2024 Intel Corporation.
3-
// (C) Copyright 2025 Hewlett Packard Enterprise Development LP
3+
// (C) Copyright 2025-2026 Hewlett Packard Enterprise Development LP
44
//
55
// SPDX-License-Identifier: BSD-2-Clause-Patent
66
//
@@ -30,6 +30,7 @@ import (
3030
"github.com/daos-stack/daos/src/control/common/test"
3131
"github.com/daos-stack/daos/src/control/lib/hostlist"
3232
"github.com/daos-stack/daos/src/control/lib/ranklist"
33+
"github.com/daos-stack/daos/src/control/logging"
3334
"github.com/daos-stack/daos/src/control/server/config"
3435
"github.com/daos-stack/daos/src/control/server/engine"
3536
"github.com/daos-stack/daos/src/control/server/storage"
@@ -945,3 +946,10 @@ func MockHostFabricMap(t *testing.T, scans ...*MockFabricScan) HostFabricMap {
945946

946947
return hfm
947948
}
949+
950+
// MockEventLogger returns EventLogger reference that has no syslog handlers registered.
951+
func MockEventLogger(logBasic logging.Logger) *EventLogger {
952+
return &EventLogger{
953+
log: logBasic,
954+
}
955+
}

src/control/server/config/server.go

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,9 @@ type Server struct {
9898
Path string `yaml:"-"` // path to config file
9999

100100
// Behavior flags
101-
AutoFormat bool `yaml:"-"`
101+
AutoFormat bool `yaml:"-"`
102+
DisableEngineAutoRestart bool `yaml:"disable_engine_auto_restart"`
103+
EngineAutoRestartMinDelay int `yaml:"engine_auto_restart_min_delay,omitempty"`
102104

103105
deprecatedParams `yaml:",inline"`
104106
}
@@ -355,6 +357,18 @@ func (cfg *Server) WithTelemetryPort(port int) *Server {
355357
return cfg
356358
}
357359

360+
// WithDisableEngineAutoRestart enables or disables automatic engine restarts on self-termination.
361+
func (cfg *Server) WithDisableEngineAutoRestart(disabled bool) *Server {
362+
cfg.DisableEngineAutoRestart = disabled
363+
return cfg
364+
}
365+
366+
// WithEngineAutoRestartMinDelay sets minimum time between automatic engine restarts.
367+
func (cfg *Server) WithEngineAutoRestartMinDelay(secs uint) *Server {
368+
cfg.EngineAutoRestartMinDelay = int(secs)
369+
return cfg
370+
}
371+
358372
// DefaultServer creates a new instance of configuration struct
359373
// populated with defaults.
360374
func DefaultServer() *Server {
@@ -830,6 +844,11 @@ func (cfg *Server) Validate(log logging.Logger) (err error) {
830844
return FaultConfigSysRsvdZero
831845
}
832846

847+
if cfg.EngineAutoRestartMinDelay < 0 {
848+
return errors.Errorf("engine_auto_restart_min_delay must be >= 0 (got %d)",
849+
cfg.EngineAutoRestartMinDelay)
850+
}
851+
833852
// A config without engines is valid when initially discovering hardware prior to adding
834853
// per-engine sections with device allocations.
835854
if len(cfg.Engines) == 0 {

src/control/server/config/server_test.go

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -266,7 +266,9 @@ func TestServerConfig_Constructed(t *testing.T) {
266266
WithHyperthreads(true). // hyper-threads disabled by default
267267
WithSystemRamReserved(5).
268268
WithAllowNumaImbalance(true).
269-
WithAllowTHP(true)
269+
WithAllowTHP(true).
270+
WithDisableEngineAutoRestart(true).
271+
WithEngineAutoRestartMinDelay(120)
270272

271273
// add engines explicitly to test functionality applied in WithEngines()
272274
constructed.Engines = []*engine.Config{

src/control/server/ctl_check_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ func TestServer_ControlService_CheckEngineRepair(t *testing.T) {
116116
t.Fatalf("setup error - wrong type for Engine (%T)", e)
117117
}
118118

119-
setupTestEngine(t, srv, uint32(i), rankNums[i])
119+
setupTestEngine(t, srv, rankNums[i])
120120

121121
drpcCfg := new(mockDrpcClientConfig)
122122
drpcCfg.ConnectError = tc.drpcErr

src/control/server/ctl_ranks_rpc.go

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
//
22
// (C) Copyright 2020-2024 Intel Corporation.
3-
// (C) Copyright 2025 Hewlett Packard Enterprise Development LP
3+
// (C) Copyright 2025-2026 Hewlett Packard Enterprise Development LP
44
//
55
// SPDX-License-Identifier: BSD-2-Clause-Patent
66
//
@@ -153,6 +153,21 @@ func (svc *ControlService) memberStateResults(instances []Engine, tgtState syste
153153
return results, nil
154154
}
155155

156+
// Clear restart history for manually stopped ranks on this server. This prevents rate-limiting
157+
// from interfering with manual operations and vice versa.
158+
func clearRankRestartHistory(mgr *engineRestartManager, instances []Engine) {
159+
ranks := make([]ranklist.Rank, 0, len(instances))
160+
for _, ei := range instances {
161+
rank, err := ei.GetRank()
162+
if err == nil {
163+
ranks = append(ranks, rank)
164+
}
165+
}
166+
if len(ranks) > 0 {
167+
mgr.clearRankRestartHistory(ranks)
168+
}
169+
}
170+
156171
// StopRanks implements the method defined for the Management Service.
157172
//
158173
// Stop data-plane instance(s) managed by control-plane identified by unique
@@ -206,6 +221,10 @@ func (svc *ControlService) StopRanks(ctx context.Context, req *ctlpb.RanksReq) (
206221
return nil, err
207222
}
208223

224+
// clear state history for stopped ranks, instances have already been filtered by
225+
// FilterInstancesByRankSet() to match req.GetRanks()
226+
clearRankRestartHistory(svc.restartMgr, instances)
227+
209228
return resp, nil
210229
}
211230

@@ -319,6 +338,10 @@ func (svc *ControlService) StartRanks(ctx context.Context, req *ctlpb.RanksReq)
319338
return nil, err
320339
}
321340

341+
// clear state history for started ranks, instances have already been filtered by
342+
// FilterInstancesByRankSet() to match req.GetRanks()
343+
clearRankRestartHistory(svc.restartMgr, instances)
344+
322345
return resp, nil
323346
}
324347

0 commit comments

Comments
 (0)