Fix: allow requeueing cleanup without removing finalizer and add tests by alex-hunt-materialize · Pull Request #1994 · kube-rs/kube

alex-hunt-materialize · 2026-06-04T12:28:29Z

Allow requeueing cleanup without removing the finalizer.

Motivation

For long running asynchronous cleanup tasks, the user previously had only two options:

Return an error. This would re-queue immediately with no control over the delay. This increases load on the controller and likely spams error traces.
Block until the cleanup task is complete. This takes a reconciliation slot for a potentially very long time.

There is also a serious foot-gun in the previous code. If a user returns an Ok(Action) from cleanup to re-queue the event, the finalizer gets removed immediately and the cleanup does not get re-invoked. There is nothing to indicate to the user that this is an incorrect use of the library. From the user's perspective, this should just re-queue and not remove the finalizer.

Solution

If the result of the cleanup function is Ok(Action { requeue_after: Some(Duration) }), short circuit before removing the finalizer to just return that action.

I also added some unit tests for the finalizer code.

Signed-off-by: Alex Hunt <alex.hunt@materialize.com>

codecov · 2026-06-04T12:44:32Z

Codecov Report

❌ Patch coverage is 98.24561% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.0%. Comparing base (a5b4f3f) to head (5a4e855).

Files with missing lines	Patch %	Lines
kube-runtime/src/finalizer.rs	98.3%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #1994     +/-   ##
=======================================
+ Coverage   77.1%   78.0%   +1.0%     
=======================================
  Files         89      89             
  Lines       8927    9042    +115     
=======================================
+ Hits        6878    7052    +174     
+ Misses      2049    1990     -59

Files with missing lines	Coverage Δ
kube-runtime/src/controller/mod.rs	`32.2% <100.0%> (+0.8%)`	⬆️
kube-runtime/src/finalizer.rs	`90.1% <98.3%> (+90.1%)`	⬆️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

clux · 2026-06-15T16:54:28Z

Ah, sorry about this. I missed this PR entirely. Going over now.

nightkr · 2026-06-15T17:19:12Z

I'm not really comfortable doing this kind of behaviour change implicitly... IMO it's fairly common practice to ~always return Action::requeue(couple of minutes), and I'm not sure I'd want to rely on nobody ever accidentally ending up doing it for finalizers too.

This would re-queue immediately with no control over the delay. This increases load on the controller and likely spams error traces.

You could filter out those errors outside of the finalizer(), so that they inhibit finalization but aren't otherwise considered errors. That's admittedly not great for ergonomics, though.

(And either way, you do control that delay via error_policy.)

Block until the cleanup task is complete. This takes a reconciliation slot for a potentially very long time.

I'd say this feels more like a design problem for the reconciliation slot system in general. Perhaps ideally we'd have a way for a task to downgrade itself, and say "well, I'm still ongoing but this is going to take a while so don't count me for the semaphore anymore".

But then again, I'd also argue that not all are made equally. We effectively have three kinds of finalizers:

Basically instant, we just need to notify some external system
We're waiting for some external notification that it was (hopefully) completed, so we should try again
We have no idea how long it'll take, so we need to poll for completion on some interval

kube::runtime::finalizer mostly models 1 today, and this PR is largely concerned with 3. Personally, I think 2 is the primary design space worth exploring, and it would largely give us 3 for free. ("Just" wait for tokio::sleep.)

clux

Thanks for this. Tests are really good. This part of the codebase was badly tested and the tower mock stuff is great for this.

This is potentially a breaking change for some, but it's explicitly only if requeue is returned from the Cleanup branch. It's a small subset of people, and from my search on github for patterns, I found that this change will help more people than it harms because it makes the behavior align more with the common expectation that a requeue (if requested) will happen.

I think we can inform people about this in the release.

clux · 2026-06-15T16:57:36Z

+            // A requeue means the cleanup is still in progress (e.g. waiting on a
+            // background process), so we keep the finalizer and let the controller
+            // re-run `Cleanup` later. Only `Action::await_change` signals that cleanup
+            // has finished and the finalizer can be removed.
+            if action.wants_requeue() {
+                return Ok(action);
+            }


The only scary part here is if people are using a blanket Action::requeue(5 minutes) in their reconcilers.

AFAIKT this used to be a common pattern for controllers to have some safety against missed events, and if i am reading this correctly, it would mean we never actually cleanup the finalizer. We would have to communicate this in the release just in case.

searched around a bit and I could only find one controller getting this wrong

https://github.com/FyraLabs/chisel-operator/blob/4b4c16e14b90019c43f77f59d3c37fb4d1ab4389/src/daemon.rs#L642-L645

what i instead find, is more users doing Action::requeue inside the Event::Cleanup stage - which does not prevent them from having the object deleted before the requeue is started. See e.g.;

kasmcloud

mayastor

as such, despite the small risk, i now think this is a net harm-reduction change because it allows the majority of these requeue requesters to work better.

clux · 2026-06-15T17:01:09Z

+    // With the finalizer already present and the object not being deleted, the finalizer runs
+    // `Apply` and passes its action straight through, without touching the API. The mock handle
+    // is dropped, so any request would error and fail the test.
+    #[tokio::test]
+    async fn apply_runs_reconcile_and_passes_action_through() {
+        let (mock_service, handle) = mock::pair::<Request<Body>, Response<Body>>();
+        drop(handle); // any API call now errors


These tests are great 👍

holding off

Fix: allow requeueing cleanup without removing finalizer and add tests

4bf9e9e

Signed-off-by: Alex Hunt <alex.hunt@materialize.com>

alex-hunt-materialize marked this pull request as ready for review June 4, 2026 12:51

alex-hunt-materialize mentioned this pull request Jun 4, 2026

Don't remove finalizer on reschedule MaterializeInc/k8s-controller#49

Open

Merge branch 'main' into finalizer_cleanup_actions

5a4e855

clux previously approved these changes Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: allow requeueing cleanup without removing finalizer and add tests#1994

Fix: allow requeueing cleanup without removing finalizer and add tests#1994
alex-hunt-materialize wants to merge 2 commits into
kube-rs:mainfrom
alex-hunt-materialize:finalizer_cleanup_actions

alex-hunt-materialize commented Jun 4, 2026

Uh oh!

codecov Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

clux commented Jun 15, 2026

Uh oh!

nightkr commented Jun 15, 2026

Uh oh!

clux left a comment

Uh oh!

clux Jun 15, 2026

Uh oh!

clux Jun 15, 2026

Uh oh!

clux Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

alex-hunt-materialize commented Jun 4, 2026

Motivation

Solution

Uh oh!

codecov Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

clux commented Jun 15, 2026

Uh oh!

nightkr commented Jun 15, 2026

Uh oh!

clux left a comment

Choose a reason for hiding this comment

Uh oh!

clux Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

clux Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

clux Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jun 4, 2026 •

edited

Loading