Skip to content

Fix: allow requeueing cleanup without removing finalizer and add tests#1994

Open
alex-hunt-materialize wants to merge 2 commits into
kube-rs:mainfrom
alex-hunt-materialize:finalizer_cleanup_actions
Open

Fix: allow requeueing cleanup without removing finalizer and add tests#1994
alex-hunt-materialize wants to merge 2 commits into
kube-rs:mainfrom
alex-hunt-materialize:finalizer_cleanup_actions

Conversation

@alex-hunt-materialize

Copy link
Copy Markdown
Contributor

Allow requeueing cleanup without removing the finalizer.

Motivation

For long running asynchronous cleanup tasks, the user previously had only two options:

  1. Return an error. This would re-queue immediately with no control over the delay. This increases load on the controller and likely spams error traces.
  2. Block until the cleanup task is complete. This takes a reconciliation slot for a potentially very long time.

There is also a serious foot-gun in the previous code. If a user returns an Ok(Action) from cleanup to re-queue the event, the finalizer gets removed immediately and the cleanup does not get re-invoked. There is nothing to indicate to the user that this is an incorrect use of the library. From the user's perspective, this should just re-queue and not remove the finalizer.

Solution

If the result of the cleanup function is Ok(Action { requeue_after: Some(Duration) }), short circuit before removing the finalizer to just return that action.

I also added some unit tests for the finalizer code.

Signed-off-by: Alex Hunt <alex.hunt@materialize.com>
@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.24561% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.0%. Comparing base (a5b4f3f) to head (5a4e855).

Files with missing lines Patch % Lines
kube-runtime/src/finalizer.rs 98.3% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #1994     +/-   ##
=======================================
+ Coverage   77.1%   78.0%   +1.0%     
=======================================
  Files         89      89             
  Lines       8927    9042    +115     
=======================================
+ Hits        6878    7052    +174     
+ Misses      2049    1990     -59     
Files with missing lines Coverage Δ
kube-runtime/src/controller/mod.rs 32.2% <100.0%> (+0.8%) ⬆️
kube-runtime/src/finalizer.rs 90.1% <98.3%> (+90.1%) ⬆️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@clux

clux commented Jun 15, 2026

Copy link
Copy Markdown
Member

Ah, sorry about this. I missed this PR entirely. Going over now.

@nightkr

nightkr commented Jun 15, 2026

Copy link
Copy Markdown
Member

I'm not really comfortable doing this kind of behaviour change implicitly... IMO it's fairly common practice to ~always return Action::requeue(couple of minutes), and I'm not sure I'd want to rely on nobody ever accidentally ending up doing it for finalizers too.

This would re-queue immediately with no control over the delay. This increases load on the controller and likely spams error traces.

You could filter out those errors outside of the finalizer(), so that they inhibit finalization but aren't otherwise considered errors. That's admittedly not great for ergonomics, though.

(And either way, you do control that delay via error_policy.)

Block until the cleanup task is complete. This takes a reconciliation slot for a potentially very long time.

I'd say this feels more like a design problem for the reconciliation slot system in general. Perhaps ideally we'd have a way for a task to downgrade itself, and say "well, I'm still ongoing but this is going to take a while so don't count me for the semaphore anymore".

But then again, I'd also argue that not all are made equally. We effectively have three kinds of finalizers:

  1. Basically instant, we just need to notify some external system
  2. We're waiting for some external notification that it was (hopefully) completed, so we should try again
  3. We have no idea how long it'll take, so we need to poll for completion on some interval

kube::runtime::finalizer mostly models 1 today, and this PR is largely concerned with 3. Personally, I think 2 is the primary design space worth exploring, and it would largely give us 3 for free. ("Just" wait for tokio::sleep.)

clux
clux previously approved these changes Jun 15, 2026

@clux clux left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. Tests are really good. This part of the codebase was badly tested and the tower mock stuff is great for this.

This is potentially a breaking change for some, but it's explicitly only if requeue is returned from the Cleanup branch. It's a small subset of people, and from my search on github for patterns, I found that this change will help more people than it harms because it makes the behavior align more with the common expectation that a requeue (if requested) will happen.

I think we can inform people about this in the release.

Comment on lines +164 to +170
// A requeue means the cleanup is still in progress (e.g. waiting on a
// background process), so we keep the finalizer and let the controller
// re-run `Cleanup` later. Only `Action::await_change` signals that cleanup
// has finished and the finalizer can be removed.
if action.wants_requeue() {
return Ok(action);
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only scary part here is if people are using a blanket Action::requeue(5 minutes) in their reconcilers.

AFAIKT this used to be a common pattern for controllers to have some safety against missed events, and if i am reading this correctly, it would mean we never actually cleanup the finalizer. We would have to communicate this in the release just in case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

searched around a bit and I could only find one controller getting this wrong

what i instead find, is more users doing Action::requeue inside the Event::Cleanup stage - which does not prevent them from having the object deleted before the requeue is started. See e.g.;

as such, despite the small risk, i now think this is a net harm-reduction change because it allows the majority of these requeue requesters to work better.

Comment on lines +376 to +382
// With the finalizer already present and the object not being deleted, the finalizer runs
// `Apply` and passes its action straight through, without touching the API. The mock handle
// is dropped, so any request would error and fail the test.
#[tokio::test]
async fn apply_runs_reconcile_and_passes_action_through() {
let (mock_service, handle) = mock::pair::<Request<Body>, Response<Body>>();
drop(handle); // any API call now errors

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are great 👍

Comment thread kube-runtime/src/finalizer.rs
@clux clux dismissed their stale review June 15, 2026 17:53

holding off

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants