Node stays SchedulingDisabled after failed deployment due to missing uncordon...

> Identified by automated analysis of [ARTESCA-16355](https://scality.atlassian.net/browse/ARTESCA-16355)
**Confidence**: high
## What needs to change
**File**: `salt/metalk8s/orchestrate/deploy_node.sls`
**Version**: 131.0.9
Add an `onfail` handler to `deploy_node.sls` that uncordons the node when any step in the orchestration fails, following the same pattern already used in `upgraded.sls`:

```salt
# Add after the 'Uncordon the node' state in deploy_node.sls
Uncordon the node on failure:
  metalk8s_cordon.node_uncordoned:
    - name: {{ node_name }}
    - onfail:
      - salt: Run the highstate
```

Additionally, consider adding a pre-flight check in the ARTESCA installer's `deploy_platform` function (or in `deployed.sls`) to uncordon all nodes at the start of a redeploy. This provides defense-in-depth in case the onfail handler doesn't fire correctly:

```salt
# Add at the beginning of deployed.sls, before node deployments
Ensure all nodes are uncordoned before deployment:
  module.run:
    - artesca_kubernetes.run_kubectl_cmd:
      - cmd: "uncordon --selector=metalk8s.scality.com/version"
      - output: ""
    - require:
      - artesca: Ensure pods in kube-system namespace on {{ nodes.bootstrap }} are ready
```

The primary fix should be in MetalK8s (`deploy_node.sls`) since that's where the cordon/uncordon lifecycle is managed. The ARTESCA-level fix is a secondary safety net.
## Root Cause
The root cause is in MetalK8s's `deploy_node.sls` Salt orchestration. During node deployment, the orchestration cordons the node (sets `unschedulable=True`) at line 140 before running the highstate. The uncordon step at line 313 has a hard `require` dependency on the highstate succeeding. If the highstate or any preceding step fails, the Salt orchestration aborts and the uncordon state is never executed, leaving the node permanently in `SchedulingDisabled` status.

In the reported scenario: (1) First deploy attempt runs `deploy_node` for all 3 nodes — each node is cordoned → highstate → uncordoned. (2) The `deploy base` step then fails at "Create Keycloak 'StorageManager' main role". (3) On redeploy, `deploy platform` re-runs `deploy_node` for each node. Each node is cordoned again (idempotent). If node-3's highstate encounters any issue during the re-deployment (possibly caused by partially-applied state from the first attempt), the orchestration fails and the uncordon step is skipped — leaving node-3 stuck in `SchedulingDisabled`.

This contrasts with the MetalK8s upgrade orchestration (`upgraded.sls`) which properly includes an `onfail` handler (lines 40-47) that uncordons all nodes when the upgrade fails. The `deploy_node.sls` orchestration has no equivalent error-handling mechanism.
## Evidence
```
salt/metalk8s/orchestrate/deploy_node.sls
├── L140: Cordon the node: ...  — The node is cordoned (SchedulingDisabled) early in the deploy_node orchestration, before the highstate runs. This sets the node's spec.unschedulable=True in the Kubernetes API.
├── L285: Run the highstate: ...  — The highstate runs AFTER the cordon step and is a prerequisite for the uncordon. If this step fails for any reason, the uncordon will never execute.
└── L313: Uncordon the node: ...  — The uncordon step has a hard 'require' on 'Run the highstate'. In Salt, if a required state fails, dependent states are skipped entirely. There is NO onfail handler to uncordon on error — the node stays cordoned forever if the highstate fails.
salt/metalk8s/orchestrate/upgraded.sls
└── L40: # Automatically uncordon nodes on upgrade failure ...  — The UPGRADE orchestration correctly handles failures by using an 'onfail' handler to uncordon all nodes when the upgrade fails. This pattern is missing from deploy_node.sls, confirming it's a known concern that was addressed for upgrades but not for initial deploys.
installer/artesca_installer/salt/artesca/metalk8s/deployed.sls
└── L247: Deploy node {{ node }}: ...  — The ARTESCA installer calls MetalK8s deploy_node orchestration for each remote node. If this orchestration fails, the ARTESCA installer has no cleanup logic to uncordon the node either — there's no error recovery at either the MetalK8s or ARTESCA level.
```
## Upstream Impact
High severity. When a 3-node ARTESCA deployment fails for any reason (including transient Keycloak errors), a node can be left permanently in SchedulingDisabled state. This prevents the user from successfully redeploying ARTESCA because: (1) pods cannot be scheduled on the affected node, (2) the same deploy_node sequence re-cordons the already-cordoned node without fixing the underlying issue. The user has no way to recover without manual intervention (running `kubectl uncordon`). This affects all multi-node ARTESCA deployments (the 3-node HA configuration is the primary deployment model).

[ARTESCA-16355]: https://scality.atlassian.net/browse/ARTESCA-16355?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node stays SchedulingDisabled after failed deployment due to missing uncordon... #4874

What needs to change

Root Cause

Evidence

Upstream Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Node stays SchedulingDisabled after failed deployment due to missing uncordon... #4874

Description

What needs to change

Root Cause

Evidence

Upstream Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions