status-reconciler started retiring whole world when its configuration became corrupted

A while ago the OpenShift Prow deployment encountered a situation where `status-reconciler` unexpectedly started retiring numerous existing job results on open pull requests, like:

```
ci/prow/images — Context retired without replacement.
```

In OpenShift, components are supplied configuration through a [git-sync sidecar](https://github.com/openshift/release/blob/master/clusters/app.ci/prow/03_deployment/statusreconciler.yaml#L41-L48), and we observed that sidecar to fail fetching the repository content:

```
{"file":"main.go","line":927},"msg":"error syncing repo, will retry","error":"Run(git fetch https://github.com/openshift/release.git master --verbose --no-progress --prune --no-auto-gc --depth 1): context deadline exceeded: { stdout: \"\", stderr: \"POST git-upload-pack (317 bytes)\\nPOST git-upload-pack (272 bytes)\" }","failCount":1}
{"logger":"","ts":"2025-10-14 02:46:52.486169","caller":{"file":"main.go","line":1350},"msg":"repo contains lock file","error":null,"path":"/tmp/git-sync/.git/shallow.lock"}
{"logger":"","ts":"2025-10-14 02:46:52.486222","caller":{"file":"main.go","line":1237},"level":0,"msg":"repo directory was empty or failed checks","path":"/tmp/git-sync"}
```

We also observed `status-reconciler` to fail loading its config from disk:

```
{"component":"status-reconciler","error":"stat /var/repo/release/ci-operator/jobs: no such file or directory","file":"sigs.k8s.io/prow/pkg/config/agent.go:371","func":"sigs.k8s.io/prow/pkg/config.(*Agent).Start.func1","jobConfig":"/var/repo/release/ci-operator/jobs","level":"error","msg":"Error loading config.","prowConfig":"/etc/config/config.yaml","severity":"error","time":"2025-10-14T10:11:43Z"}
```

Which did not stop it from doing its job, reconciling statuses on in-flight pull requests to the the set of jobs configured for the given org/repo/branch. And because it apparently saw empty job config because it was failing to load prow and job config from disk (see above), it was reconciling the world to the desired state of "no jobs exist":

```
{"client":"github","component":"status-reconciler","duration":"12.404288439s","file":"sigs.k8s.io/prow/pkg/github/client.go:806","func":"sigs.k8s.io/prow/pkg/github.(*client).log.func2","level":"debug","msg":"CreateStatus(opendatahub-io, odh-dashboard, 8e06c54e2bd62e79b070f1492271dc87d1503233, {success Context retired without replacement. ci/prow/odh-dashboard-pr-image-mirror}) finished","severity":"debug","time":"2025-10-14T10:11:43Z"}
```

This is obviously catastrophic, because retiring existing context overwrote results of the jobs on the open PRs with false passing signal, potentially allowing the PRs to merge (I am not entirely sure about Tide behavior when it encounters a mergeable Pr with a retired green status that matches a required existing job; Tide's retest stale result logic before merge may save the day by forcing falsely retired jobs to be re-run. I will open a separate issue to check this behavior and if Tide does not behave this then we may want to make it so).

Status reconciler should be fixed so never actuate unless it has a provably good config loaded. If its dynamic config loading is failing (like it was in the example above), it should cease reconciling, and strongly signal, through metrics at least, that it is not actuating right now. We may consider flipping its liveness probe endpoint to false, to signal kubernetes that the pod is not healthy and cause it to restart.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

status-reconciler started retiring whole world when its configuration became corrupted #540

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

status-reconciler started retiring whole world when its configuration became corrupted #540

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions