feat: workflow recovery for Kubernetes backend agents #5930

hhamalai · 2026-01-09T15:06:35Z

This PR introduces a workflow recovery mechanism, specifically implemented for the Kubernetes backend. It allows pipelines to resume from their last known state after an agent restart by persisting workflow progress in Kubernetes ConfigMaps.

why

For larger deployments the agents must be occasionally updated/scaled, which currently causes all CI jobs to be interrupted as agents keep the workflow execution config in memory, lost during restarts, which causes headache especially with long running / critical workflows.

how

With kubernetes backend the actual step execution in running pods can continue independently from the agents, as agents are used to manage workflow volume, secrets and step pods. This PR introduces a bookkeeping mechanism to maintain a record of the workflow's progress. This bookkeeping is done with ConfigMaps, allowing an agent to identify which steps are pending, running, or completed. If the executing agent is lost, the workflow becomes available from the server queue and new agent can continue the workflow execution. This enables the pipeline to resume correctly following an agent restart or failure.

If agents are all offline while there are workflow executions, and the workflows are cancelled the workflow state ConfigMaps are left behind, which is why there is a cleanup goroutine, running on a chosen agent selected by the leader election.

what else

the woodpecker helm chart must be updated to grant extra RBAC permissions (list/get persistentvolumesclaims, manage leases for leaderelection)
tried to keep the changes to non-kubernetes backend minimal, and without side-effects when workflow recovery is not used on Kubernetes backend
the recovered workflows might produce double logging visible on UI (original agent streams logs until it's deleted, the new agent taking over the workflow management will stream the same logs from the beginning).
- at no circumstances should the same step be executed twice.

qwerty287 · 2026-01-17T07:00:07Z

Hi @hhamalai and thanks for the PR. Sorry that we didn't come back to you in the last days. I'll try to check this PR later with my limited k8s knowledge.

But as this is quite a big feature, somebody else from @woodpecker-ci/maintainers should take a look as well.

And, just wondering, would something similar be possible with docker too? Because I'd pike to avoid splitting features too much between backends. If possible, a feature should be supported on all backends.

xoxys · 2026-01-17T07:04:40Z

Have not checked the implementation yet but wondering why Kubernetes itself needs to be the persistent layer here. Cant the wp server/db be used to keep track of it regardless of the backend?

hhamalai · 2026-01-19T10:05:18Z

Have not checked the implementation yet but wondering why Kubernetes itself needs to be the persistent layer here. Cant the wp server/db be used to keep track of it regardless of the backend?

Currently the whole workflow is given to agent, so the Woodpecker server doesn't know about individual step states within a workflow.

hhamalai · 2026-01-19T10:13:22Z

And, just wondering, would something similar be possible with docker too? Because I'd pike to avoid splitting features too much between backends. If possible, a feature should be supported on all backends.

It would be great, currently I've not checked how docker agent is executing the workflows. The fundamental problem with current Kubernetes agent is, that the agents are executed as StatefulSet, and updating the agent image/configuration causes all the pods in the StatefulSet to restart, which interrupts all running jobs, making it next to impossible to find suitable upgrade windows without blocking the agents for extended period of times (i.e. stop agents from taking jobs, but long builds will have to finish first -> the maintenance break will be at least as long as the longest running build is still executing, before the StatefulSet can be safely restarted without affecting running builds).

hhamalai added 3 commits January 9, 2026 16:56

feat: introduce workflow recovery framework for Kubernetes backend

f4951fd

Merge branch 'main' into agent_recovery

8a4792a

Merge branch 'main' into agent_recovery

a1e0a45

qwerty287 added feature add new functionality backend/kubernetes labels Jan 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: workflow recovery for Kubernetes backend agents #5930

feat: workflow recovery for Kubernetes backend agents #5930

Uh oh!

hhamalai commented Jan 9, 2026

Uh oh!

qwerty287 commented Jan 17, 2026

Uh oh!

xoxys commented Jan 17, 2026

Uh oh!

hhamalai commented Jan 19, 2026

Uh oh!

hhamalai commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

feat: workflow recovery for Kubernetes backend agents #5930

Are you sure you want to change the base?

feat: workflow recovery for Kubernetes backend agents #5930

Uh oh!

Conversation

hhamalai commented Jan 9, 2026

why

how

what else

Uh oh!

qwerty287 commented Jan 17, 2026

Uh oh!

xoxys commented Jan 17, 2026

Uh oh!

hhamalai commented Jan 19, 2026

Uh oh!

hhamalai commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants