Skip to content

status-reconciler started retiring whole world when its configuration became corrupted #540

@petr-muller

Description

@petr-muller

A while ago the OpenShift Prow deployment encountered a situation where status-reconciler unexpectedly started retiring numerous existing job results on open pull requests, like:

ci/prow/images — Context retired without replacement.

In OpenShift, components are supplied configuration through a git-sync sidecar, and we observed that sidecar to fail fetching the repository content:

{"file":"main.go","line":927},"msg":"error syncing repo, will retry","error":"Run(git fetch https://github.com/openshift/release.git master --verbose --no-progress --prune --no-auto-gc --depth 1): context deadline exceeded: { stdout: \"\", stderr: \"POST git-upload-pack (317 bytes)\\nPOST git-upload-pack (272 bytes)\" }","failCount":1}
{"logger":"","ts":"2025-10-14 02:46:52.486169","caller":{"file":"main.go","line":1350},"msg":"repo contains lock file","error":null,"path":"/tmp/git-sync/.git/shallow.lock"}
{"logger":"","ts":"2025-10-14 02:46:52.486222","caller":{"file":"main.go","line":1237},"level":0,"msg":"repo directory was empty or failed checks","path":"/tmp/git-sync"}

We also observed status-reconciler to fail loading its config from disk:

{"component":"status-reconciler","error":"stat /var/repo/release/ci-operator/jobs: no such file or directory","file":"sigs.k8s.io/prow/pkg/config/agent.go:371","func":"sigs.k8s.io/prow/pkg/config.(*Agent).Start.func1","jobConfig":"/var/repo/release/ci-operator/jobs","level":"error","msg":"Error loading config.","prowConfig":"/etc/config/config.yaml","severity":"error","time":"2025-10-14T10:11:43Z"}

Which did not stop it from doing its job, reconciling statuses on in-flight pull requests to the the set of jobs configured for the given org/repo/branch. And because it apparently saw empty job config because it was failing to load prow and job config from disk (see above), it was reconciling the world to the desired state of "no jobs exist":

{"client":"github","component":"status-reconciler","duration":"12.404288439s","file":"sigs.k8s.io/prow/pkg/github/client.go:806","func":"sigs.k8s.io/prow/pkg/github.(*client).log.func2","level":"debug","msg":"CreateStatus(opendatahub-io, odh-dashboard, 8e06c54e2bd62e79b070f1492271dc87d1503233, {success Context retired without replacement. ci/prow/odh-dashboard-pr-image-mirror}) finished","severity":"debug","time":"2025-10-14T10:11:43Z"}

This is obviously catastrophic, because retiring existing context overwrote results of the jobs on the open PRs with false passing signal, potentially allowing the PRs to merge (I am not entirely sure about Tide behavior when it encounters a mergeable Pr with a retired green status that matches a required existing job; Tide's retest stale result logic before merge may save the day by forcing falsely retired jobs to be re-run. I will open a separate issue to check this behavior and if Tide does not behave this then we may want to make it so).

Status reconciler should be fixed so never actuate unless it has a provably good config loaded. If its dynamic config loading is failing (like it was in the example above), it should cease reconciling, and strongly signal, through metrics at least, that it is not actuating right now. We may consider flipping its liveness probe endpoint to false, to signal kubernetes that the pod is not healthy and cause it to restart.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/status-reconcilerIssues or PRs related to reconciling status when jobs changehelp wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions