You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DAOS-17427 control: Restart excluded rank after suicide (#16279) (#18422)
When an engine detects that it has been removed from the system group
map by receiving a CART event, it will now notify its local control plane
with a RAS engine_self_terminated event before terminating its own
process. After receiving this self-termination event, the local
control plane will restart the engine so it can rejoin the system.
The goal of this change is to improve overall system resilience by
automatically recovering engines that are excluded because of
temporary issues such as network instability. Once the engines rejoin,
the rank will still need to be reintegrated into pools as a separate
follow‑up step.
Rate-limiting prevents restart storms: a configurable minimum delay
(default 300 seconds) between restarts per rank ensures system
stability. Two new server config file parameters control behavior:
disable_engine_auto_restart (boolean, default false) completely
disables automatic restarts, while engine_auto_restart_min_delay
(integer seconds) sets the minimum time between consecutive restart
attempts.
Functional tests for the automatic engine restart feature included
with cases to verify disabling, rate-limiting and configuration support.
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Copy file name to clipboardExpand all lines: docs/admin/administration.md
+99-10Lines changed: 99 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,6 +49,8 @@ severity, message, description, and cause.
49
49
| engine\_died| STATE\_CHANGE| ERROR| DAOS engine <idx\> exited exited unexpectedly: <error\>| Indicates engine instance <idx\> unexpectedly. <error> describes the exit state returned from exited daos\_engine process.| N/A |
50
50
| engine\_asserted| STATE\_CHANGE| ERROR| TBD| Indicates engine instance <idx\> threw a runtime assertion, causing a crash. | An unexpected internal state resulted in assert failure. |
51
51
| engine\_clock\_drift| INFO\_ONLY | ERROR| clock drift detected| Indicates CART comms layer has detected clock skew between engines.| NTP may not be syncing clocks across DAOS system. |
52
+
| engine\_self\_terminated| INFO\_ONLY| NOTICE| excluded rank self terminated detected| Indicates that a DAOS engine rank has performed a self-termination due to having been excluded from the system's group map. The rank is automatically restarted by the control plane with rate-limiting (default: 5 minute minimum delay between restarts per rank) to prevent restart storms. | An engine was found to be in a transient non-functional state and excluded from the group map. The control plane monitors for this event and automatically restarts the affected engine so it can rejoin the system. Restarts are rate-limited per rank using the `engine_auto_restart_min_delay` configuration parameter. |
53
+
| engine\_join\_failed| INFO\_ONLY| ERROR | DAOS engine <idx\> (rank <rank\>) was not allowed to join the system | Join operation failed for the given engine instance ID and rank (if assigned). | Reason should be provided in the extended info field of the event data. |
52
54
| pool\_corruption\_detected| INFO\_ONLY| ERROR | Data corruption detected| Indicates a corruption in pool data has been detected. The event fields will contain pool and container UUIDs. | A corruption was found by the checksum scrubber. |
53
55
| pool\_rebuild\_started| INFO\_ONLY| NOTICE | Pool rebuild started.| Indicates a pool rebuild has started. The event data field contains pool map version and pool operation identifier. | When a pool rank becomes unavailable a rebuild will be triggered. |
54
56
| pool\_rebuild\_finished| INFO\_ONLY| NOTICE| Pool rebuild finished.| Indicates a pool rebuild has finished successfully. The event data field includes the pool map version and pool operation identifier. | N/A|
@@ -69,7 +71,6 @@ severity, message, description, and cause.
69
71
| device\_plugged| INFO\_ONLY| NOTICE| Detected hot plugged device: <bdev-name\>| Indicates device was physically inserted into host. | NVMe SSD physically added to host. |
70
72
| device\_replace| INFO\_ONLY| NOTICE or ERROR| Replaced device: <uuid\> with device: <uuid\>[failed: <rc\>]| Indicates that a faulty device was replaced with a new device and if the operation failed. The old and new device IDs as well as any non-zero return code are specified in the event data. | Device was replaced using DMG nvme replace command. |
71
73
| system\_fabric\_provider\_changed| INFO\_ONLY| NOTICE| System fabric provider has changed: <old-provider\> -> <new-provider\>| Indicates that the system-wide fabric provider has been updated. No other specific information is included in event data.| A system-wide fabric provider change has been intentionally applied to all joined ranks.|
72
-
| engine\_join\_failed| INFO\_ONLY| ERROR | DAOS engine <idx\> (rank <rank\>) was not allowed to join the system | Join operation failed for the given engine instance ID and rank (if assigned). | Reason should be provided in the extended info field of the event data. |
73
74
| device\_link\_speed\_changed| INFO\_ONLY| NOTICE or WARNING| NVMe PCIe device at <pci-address\> port-<idx\>: link speed changed to <transfer-rate\> (max <transfer-rate\>)| Indicates that an NVMe device link speed has changed. The negotiated and maximum device link speeds are indicated in the event message field and the severity is set to warning if the negotiated speed is not at maximum capability (and notice level severity if at maximum). No other specific information is included in the event data.| Either device link speed was previously downgraded and has returned to maximum or link speed has downgraded to a value that is less than its maximum capability.|
74
75
| device\_link\_width\_changed| INFO\_ONLY| NOTICE or WARNING| NVMe PCIe device at <pci-address\> port-<idx\>: link width changed to <pcie-link-lanes\> (max <pcie-link-lanes\>)| Indicates that an NVMe device link width has changed. The negotiated and maximum device link widths are indicated in the event message field and the severity is set to warning if the negotiated width is not at maximum capability (and notice level severity if at maximum). No other specific information is included in the event data.| Either device link width was previously downgraded and has returned to maximum or link width has downgraded to a value that is less than its maximum capability.|
75
76
| device\_led\_set| INFO\_ONLY| NOTICE| LED on device <device\> set to state <state\>| Indicates that the LED state has been changed on a device. Device identifier and LED state are specified in the event message.| LED control command was issued to change device LED state for visual identification or fault indication.|
@@ -1007,6 +1008,94 @@ specified on the command line:
1007
1008
If the ranks were excluded from pools (e.g., unclean shutdown), they will need to
1008
1009
be reintegrated. Please see the pool operation section for more information.
1009
1010
1011
+
### Engine Auto-Restart
1012
+
1013
+
DAOS automatically restarts engines that self-terminate after being excluded from
1014
+
the system. This feature improves system availability by recovering from transient
1015
+
failures without administrator intervention.
1016
+
1017
+
#### How It Works
1018
+
1019
+
When an engine is excluded (e.g., due to network issues detected by SWIM), the
1020
+
engine detects the exclusion and performs a self-termination. The control plane
1021
+
monitors for these events and automatically restarts the affected engine after
1022
+
clearing the exclusion state, allowing it to rejoin the system.
1023
+
1024
+
The automatic restart includes rate-limiting to prevent restart storms. By default,
1025
+
an engine must wait 5 minutes between automatic restarts.
1026
+
1027
+
#### Configuration
1028
+
1029
+
Control auto-restart behavior in `daos_server.yml`:
1030
+
1031
+
```yaml
1032
+
# Disable automatic restart (default: enabled)
1033
+
disable_engine_auto_restart: false
1034
+
1035
+
# Minimum delay between automatic restarts per rank (default: 300 seconds)
1036
+
engine_auto_restart_min_delay: 300
1037
+
```
1038
+
1039
+
#### Manual Operations
1040
+
1041
+
Manual `dmg system stop` and `dmg system start` operations are never affected by
1042
+
the rate-limiting mechanism. Administrators can always immediately stop and start
1043
+
ranks regardless of recent automatic restart activity.
1044
+
1045
+
```bash
1046
+
# Manual operations always work immediately
1047
+
$ dmg system stop --ranks=0,1,2
1048
+
$ dmg system start --ranks=0,1,2
1049
+
```
1050
+
1051
+
When you manually stop or start ranks, the restart history for those ranks is
1052
+
automatically cleared, ensuring no delays from previous automatic restarts.
1053
+
1054
+
#### Monitoring
1055
+
1056
+
The `engine_self_terminated` RAS event is logged when an engine self-terminates
0 commit comments