Identified by automated analysis of ARTESCA-16355
Confidence: high
What needs to change
File: salt/metalk8s/orchestrate/deploy_node.sls
Version: 131.0.9
Add an onfail handler to deploy_node.sls that uncordons the node when any step in the orchestration fails, following the same pattern already used in upgraded.sls:
# Add after the 'Uncordon the node' state in deploy_node.sls
Uncordon the node on failure:
metalk8s_cordon.node_uncordoned:
- name: {{ node_name }}
- onfail:
- salt: Run the highstate
Additionally, consider adding a pre-flight check in the ARTESCA installer's deploy_platform function (or in deployed.sls) to uncordon all nodes at the start of a redeploy. This provides defense-in-depth in case the onfail handler doesn't fire correctly:
# Add at the beginning of deployed.sls, before node deployments
Ensure all nodes are uncordoned before deployment:
module.run:
- artesca_kubernetes.run_kubectl_cmd:
- cmd: "uncordon --selector=metalk8s.scality.com/version"
- output: ""
- require:
- artesca: Ensure pods in kube-system namespace on {{ nodes.bootstrap }} are ready
The primary fix should be in MetalK8s (deploy_node.sls) since that's where the cordon/uncordon lifecycle is managed. The ARTESCA-level fix is a secondary safety net.
Root Cause
The root cause is in MetalK8s's deploy_node.sls Salt orchestration. During node deployment, the orchestration cordons the node (sets unschedulable=True) at line 140 before running the highstate. The uncordon step at line 313 has a hard require dependency on the highstate succeeding. If the highstate or any preceding step fails, the Salt orchestration aborts and the uncordon state is never executed, leaving the node permanently in SchedulingDisabled status.
In the reported scenario: (1) First deploy attempt runs deploy_node for all 3 nodes — each node is cordoned → highstate → uncordoned. (2) The deploy base step then fails at "Create Keycloak 'StorageManager' main role". (3) On redeploy, deploy platform re-runs deploy_node for each node. Each node is cordoned again (idempotent). If node-3's highstate encounters any issue during the re-deployment (possibly caused by partially-applied state from the first attempt), the orchestration fails and the uncordon step is skipped — leaving node-3 stuck in SchedulingDisabled.
This contrasts with the MetalK8s upgrade orchestration (upgraded.sls) which properly includes an onfail handler (lines 40-47) that uncordons all nodes when the upgrade fails. The deploy_node.sls orchestration has no equivalent error-handling mechanism.
Evidence
salt/metalk8s/orchestrate/deploy_node.sls
├── L140: Cordon the node: ... — The node is cordoned (SchedulingDisabled) early in the deploy_node orchestration, before the highstate runs. This sets the node's spec.unschedulable=True in the Kubernetes API.
├── L285: Run the highstate: ... — The highstate runs AFTER the cordon step and is a prerequisite for the uncordon. If this step fails for any reason, the uncordon will never execute.
└── L313: Uncordon the node: ... — The uncordon step has a hard 'require' on 'Run the highstate'. In Salt, if a required state fails, dependent states are skipped entirely. There is NO onfail handler to uncordon on error — the node stays cordoned forever if the highstate fails.
salt/metalk8s/orchestrate/upgraded.sls
└── L40: # Automatically uncordon nodes on upgrade failure ... — The UPGRADE orchestration correctly handles failures by using an 'onfail' handler to uncordon all nodes when the upgrade fails. This pattern is missing from deploy_node.sls, confirming it's a known concern that was addressed for upgrades but not for initial deploys.
installer/artesca_installer/salt/artesca/metalk8s/deployed.sls
└── L247: Deploy node {{ node }}: ... — The ARTESCA installer calls MetalK8s deploy_node orchestration for each remote node. If this orchestration fails, the ARTESCA installer has no cleanup logic to uncordon the node either — there's no error recovery at either the MetalK8s or ARTESCA level.
Upstream Impact
High severity. When a 3-node ARTESCA deployment fails for any reason (including transient Keycloak errors), a node can be left permanently in SchedulingDisabled state. This prevents the user from successfully redeploying ARTESCA because: (1) pods cannot be scheduled on the affected node, (2) the same deploy_node sequence re-cordons the already-cordoned node without fixing the underlying issue. The user has no way to recover without manual intervention (running kubectl uncordon). This affects all multi-node ARTESCA deployments (the 3-node HA configuration is the primary deployment model).
What needs to change
File:
salt/metalk8s/orchestrate/deploy_node.slsVersion: 131.0.9
Add an
onfailhandler todeploy_node.slsthat uncordons the node when any step in the orchestration fails, following the same pattern already used inupgraded.sls:Additionally, consider adding a pre-flight check in the ARTESCA installer's
deploy_platformfunction (or indeployed.sls) to uncordon all nodes at the start of a redeploy. This provides defense-in-depth in case the onfail handler doesn't fire correctly:The primary fix should be in MetalK8s (
deploy_node.sls) since that's where the cordon/uncordon lifecycle is managed. The ARTESCA-level fix is a secondary safety net.Root Cause
The root cause is in MetalK8s's
deploy_node.slsSalt orchestration. During node deployment, the orchestration cordons the node (setsunschedulable=True) at line 140 before running the highstate. The uncordon step at line 313 has a hardrequiredependency on the highstate succeeding. If the highstate or any preceding step fails, the Salt orchestration aborts and the uncordon state is never executed, leaving the node permanently inSchedulingDisabledstatus.In the reported scenario: (1) First deploy attempt runs
deploy_nodefor all 3 nodes — each node is cordoned → highstate → uncordoned. (2) Thedeploy basestep then fails at "Create Keycloak 'StorageManager' main role". (3) On redeploy,deploy platformre-runsdeploy_nodefor each node. Each node is cordoned again (idempotent). If node-3's highstate encounters any issue during the re-deployment (possibly caused by partially-applied state from the first attempt), the orchestration fails and the uncordon step is skipped — leaving node-3 stuck inSchedulingDisabled.This contrasts with the MetalK8s upgrade orchestration (
upgraded.sls) which properly includes anonfailhandler (lines 40-47) that uncordons all nodes when the upgrade fails. Thedeploy_node.slsorchestration has no equivalent error-handling mechanism.Evidence
Upstream Impact
High severity. When a 3-node ARTESCA deployment fails for any reason (including transient Keycloak errors), a node can be left permanently in SchedulingDisabled state. This prevents the user from successfully redeploying ARTESCA because: (1) pods cannot be scheduled on the affected node, (2) the same deploy_node sequence re-cordons the already-cordoned node without fixing the underlying issue. The user has no way to recover without manual intervention (running
kubectl uncordon). This affects all multi-node ARTESCA deployments (the 3-node HA configuration is the primary deployment model).