feat: workflow recovery for Kubernetes backend agents #5930
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a workflow recovery mechanism, specifically implemented for the Kubernetes backend. It allows pipelines to resume from their last known state after an agent restart by persisting workflow progress in Kubernetes ConfigMaps.
why
For larger deployments the agents must be occasionally updated/scaled, which currently causes all CI jobs to be interrupted as agents keep the workflow execution config in memory, lost during restarts, which causes headache especially with long running / critical workflows.
how
With kubernetes backend the actual step execution in running pods can continue independently from the agents, as agents are used to manage workflow volume, secrets and step pods. This PR introduces a bookkeeping mechanism to maintain a record of the workflow's progress. This bookkeeping is done with ConfigMaps, allowing an agent to identify which steps are pending, running, or completed. If the executing agent is lost, the workflow becomes available from the server queue and new agent can continue the workflow execution. This enables the pipeline to resume correctly following an agent restart or failure.
If agents are all offline while there are workflow executions, and the workflows are cancelled the workflow state ConfigMaps are left behind, which is why there is a cleanup goroutine, running on a chosen agent selected by the leader election.
what else