Skip to content

Conversation

@hhamalai
Copy link
Contributor

@hhamalai hhamalai commented Jan 9, 2026

This PR introduces a workflow recovery mechanism, specifically implemented for the Kubernetes backend. It allows pipelines to resume from their last known state after an agent restart by persisting workflow progress in Kubernetes ConfigMaps.

why

For larger deployments the agents must be occasionally updated/scaled, which currently causes all CI jobs to be interrupted as agents keep the workflow execution config in memory, lost during restarts, which causes headache especially with long running / critical workflows.

how

With kubernetes backend the actual step execution in running pods can continue independently from the agents, as agents are used to manage workflow volume, secrets and step pods. This PR introduces a bookkeeping mechanism to maintain a record of the workflow's progress. This bookkeeping is done with ConfigMaps, allowing an agent to identify which steps are pending, running, or completed. If the executing agent is lost, the workflow becomes available from the server queue and new agent can continue the workflow execution. This enables the pipeline to resume correctly following an agent restart or failure.

If agents are all offline while there are workflow executions, and the workflows are cancelled the workflow state ConfigMaps are left behind, which is why there is a cleanup goroutine, running on a chosen agent selected by the leader election.

what else

  • the woodpecker helm chart must be updated to grant extra RBAC permissions (list/get persistentvolumesclaims, manage leases for leaderelection)
  • tried to keep the changes to non-kubernetes backend minimal, and without side-effects when workflow recovery is not used on Kubernetes backend
  • the recovered workflows might produce double logging visible on UI (original agent streams logs until it's deleted, the new agent taking over the workflow management will stream the same logs from the beginning).
    • at no circumstances should the same step be executed twice.

@qwerty287
Copy link
Contributor

Hi @hhamalai and thanks for the PR. Sorry that we didn't come back to you in the last days. I'll try to check this PR later with my limited k8s knowledge.

But as this is quite a big feature, somebody else from @woodpecker-ci/maintainers should take a look as well.

And, just wondering, would something similar be possible with docker too? Because I'd pike to avoid splitting features too much between backends. If possible, a feature should be supported on all backends.

@qwerty287 qwerty287 added feature add new functionality backend/kubernetes labels Jan 17, 2026
@xoxys
Copy link
Member

xoxys commented Jan 17, 2026

Have not checked the implementation yet but wondering why Kubernetes itself needs to be the persistent layer here. Cant the wp server/db be used to keep track of it regardless of the backend?

@hhamalai
Copy link
Contributor Author

Have not checked the implementation yet but wondering why Kubernetes itself needs to be the persistent layer here. Cant the wp server/db be used to keep track of it regardless of the backend?

Currently the whole workflow is given to agent, so the Woodpecker server doesn't know about individual step states within a workflow.

@hhamalai
Copy link
Contributor Author

And, just wondering, would something similar be possible with docker too? Because I'd pike to avoid splitting features too much between backends. If possible, a feature should be supported on all backends.

It would be great, currently I've not checked how docker agent is executing the workflows. The fundamental problem with current Kubernetes agent is, that the agents are executed as StatefulSet, and updating the agent image/configuration causes all the pods in the StatefulSet to restart, which interrupts all running jobs, making it next to impossible to find suitable upgrade windows without blocking the agents for extended period of times (i.e. stop agents from taking jobs, but long builds will have to finish first -> the maintenance break will be at least as long as the longest running build is still executing, before the StatefulSet can be safely restarted without affecting running builds).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend/kubernetes feature add new functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants