Skip to content

Asynchronous Inadmissible Workload Requeueing#9232

Open
gabesaba wants to merge 3 commits intokubernetes-sigs:mainfrom
gabesaba:async_inadmissible
Open

Asynchronous Inadmissible Workload Requeueing#9232
gabesaba wants to merge 3 commits intokubernetes-sigs:mainfrom
gabesaba:async_inadmissible

Conversation

@gabesaba
Copy link
Contributor

@gabesaba gabesaba commented Feb 13, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

We decouple requesting inadmissible workloads to be reprocessed,
from the processing of these requests where we move the workloads.

This has a few main advantages

  1. Allows batching requeues, avoiding the starvation noted here
  2. Reduces lock contention
  3. Avoid spamming requeues 20+ times per second.

We accomplish this by defining the inadmissibleWorkloadRequeuer,
which implements the requeueInadmissibleListener interface.

Any requests to requeue inadmisisble workloads must go through this interface.
Then, the requeuer will process these requests in batches. The requeuer
deduplicates requests to the same ClusterQueue/Root Cohort, further reducing
duplicate reprocessing.

Which issue(s) this PR fixes:

Fixes #8095

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Scheduling: Fix the bug where inadmissible workloads would be re-queued too frequently at scale.
This resulted in excessive processing, lock contention, and starvation of workloads deeper in the queue.
The fix is to throttle the process with a batch period of 1s per CQ or Cohort.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 13, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gabesaba

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 13, 2026
@gabesaba
Copy link
Contributor Author

@sohankunkerkar, can you drive this review please?

@netlify
Copy link

netlify bot commented Feb 13, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 50c86d6
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/699329c78099d600087ad65a

Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often are the inadmissible workloads requeued after this change?

Copy link
Member

@sohankunkerkar sohankunkerkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @gabesaba. The approach of decoupling notification from processing via TypedController[requeueRequest] with AddAfter directly targets the QueueInadmissibleWorkloads churn identified in #8095.
As @mwielgus mentioned, it would really help if you could provide some numbers showing the performance improvement.

Note: On testing the batching behavior: Integration tests use 1ms batch period, which effectively makes them synchronous and doesn't exercise the actual batching/deduplication.

So client-go's delaying_queue implementation: deduplication only happens while the item is in the waiting heap. With a 1ms window, items mature to the active queue almost instantly, meaning a stream of events spaced >1ms apart will result in multiple reconciles instead of one. A unit test with a controlled clock or a longer batch period is necessary to prove this scenario.

func (r *inadmissibleWorkloadRequeuer) setupWithManager(mgr ctrl.Manager) error {
return builder.TypedControllerManagedBy[requeueRequest](mgr).
Named("inadmissible_workload_requeue_controller").
WatchesRawSource(source.TypedChannel(r.eventCh, &inadmissibleWorkloadRequeuer{})).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inadmissibleWorkloadRequeuer{} is a zero-value instance and its batchPeriod is 0. The Generic handler calls q.AddAfter(e.Object, r.batchPeriod), and client-go's AddAfter with duration <= 0 falls through to q.Add(item) immediately with no delay.
This means the 1s batch period defined here is never applied. Should this be r instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, should definitely be r. I'll add an integration test to, to make sure we have this code under test.

// q.AddAfter will process this so fast that it is not necessary.
// LLM review suggested this to derisk deadlock (during startup?), but I don't
// see this risk.
eventCh: make(chan event.TypedGenericEvent[requeueRequest]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at controller-runtime's source.TypedChannel internals where syncLoop starts a goroutine that continuously reads from the user channel and writes to an internal 1024-buffered dst channel. Since all callers (NotifyRetryInadmissible, notifyRetryInadmissibleWithoutLock, AddOrUpdateCohort) are reconciler goroutines that only run after mgr.Start() has started the source.

Even though unbuffered channel looks safe with current TypedChannel startup ordering. Still, for future DRA/WAS/TAS growth, should we prefer a small bounded buffer here as the default resilience pattern for internal event fan-in? We still keep dedup semantics in AddAfter with keyed requeueRequest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the analysis, and this sgtm. How does 128 sound to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds good!

}

// inadmissibleWorkloadRequeuer is responsible for receiving notifications,
// and requeuering workloads as a result of these notifications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/requeuering/requeuing/g

reportMetrics(m, cqImpl.name)

if queued || addedWorkloads {
if addedWorkloads {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is the intentional tradeoff for batching?
This is fine if intentional, but worth explicitly documenting as throughput-over-latency tradeoff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #9232 (comment). Broadcast wlil be a no-op until the workloads are moved. I add a broadcast call in requeueWorkloadsCQ and requeueWorkloadsCohort

// or otherwise risk encountering an infinite loop if a Cohort
// cycle is introduced.
func requeueWorkloadsCohort(ctx context.Context, m *Manager, cohort *cohort) bool {
// RequeueCohort moves all inadmissibleWorkloads in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we can remove this information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment, combining information from both parts

batchPeriod time.Duration
}

func newInadmissibleWorkloadReconciler(qManager *Manager) *inadmissibleWorkloadRequeuer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor says "Reconciler", type says "Requeuer". Pick one. Since the type is inadmissibleWorkloadRequeuer, the constructor should be newInadmissibleWorkloadRequeuer.


// NewManagerForIntegrationTests is a factory for Integration Tests, setting the
// batch period to a much lower value (requeueBatchPeriodIntegrationTests).
func NewManagerForIntegrationTests(client client.Client, checker StatusChecker, options ...Option) *Manager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the functions? I would argue against introducing such functions for testing.

It is a cleaner pattern to parametrize the value by passing the Options. Here you could expose a param like BatchInterval.

For example this is how controller-runtime exposes configuration for manager's RetryPeriod and other options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the options for NewManager, and then some factory that resides in the integration tests, that sets up integration tests with certain options (such as a smaller batching period)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, yes generic NewManager in the production code, and NewManagerForIntegrationTests for the test packages sgtm

}

// NewManagerForUnitTestsWithRequeuer creates a new Manager for testing purposes, pre-configured with a testInadmissibleWorkloadRequeuer.
func NewManagerForUnitTestsWithRequeuer(client client.Client, checker StatusChecker, options ...Option) (*Manager, *testInadmissibleWorkloadRequeuer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unit tests it is slightly more acceptable, but also I would argue against it. If we need to have a dedicated parametrization in unit tests we can use Options to the standard NewManager.

Copy link
Contributor

@mimowo mimowo Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, I'm wondering if we could pass clock when constructing the manager (might be optional parameter, real clock by default), and just move the clock by one second in unit tests. This aligns more with the current approach to time-based testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the new controller to work, it needs to be wired with ctrl.Manager, and these requeues will be processed in a different go-routine. I don't think this is the right choice for unit tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the new controller to work, it needs to be wired with ctrl.Manager, and these requeues will be processed in a different go-routine. I don't think this is the right choice for unit tests.

Hm, let me double check why you consider this is not the right choice? Actually, for unit tests it is great to be able to control the passage of time. Otherwise we risk flakes, this is why we tend to use fake clocks for most of our unit testing.

I remember this is what we do in the core k8s when using workqueues, just pass time and this triggers a goroutine inside the workqueue mechanics triggers the reconcile. So, I think what is needed is to call NewControllerManagedBy and customize the clock.

Let me know if there are some issues why this is not valid approach. It is not a blocker for me, just feels like more natural, safer (to avoid flakes), and consistent with the rest of codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still requires setting up a controller-runtime ctrl.Manager in the unit tests, to call NewControllerManagedBy (or TypedControllerManagedBy). This seems to me deep in the territory of integration tests, especially since existing unit tests don't have a ctrl.Manager. Even if this were not required, we're relying on processing in go-routines and will require some syncronization to ensure the tests are not flaky.

I think that this lightweight test object makes more sense for unit tests, and covering the behavior of the async controller should be handled in integration tests (to be expanded as per #9232 (comment))

Copy link
Contributor

@mimowo mimowo Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this is a matter of taste. Indeed, spawning a goroutine is what we do in unit tests for in the k8s contorllers, for example here. However, there it is a bit more lightweight as the basic workqueue is used, rather than the entire controller-runtime's manager.

I think that this lightweight test object makes more sense for unit tests, and covering the behavior of the async controller should be handled in integration tests (to be expanded as per #9232 (comment))

Yes, but the drawback is leaking test-only functions into prod code (at least in the initial implementation), maybe this is solvable by other means.

In any case, I'm ok with the pattern as is, it is not a blocker, just something I wanted to explore as we go into the territory of using controller-runtime reconcilers for internal code.

I think taking a pause of how to do it well is time worth spent. I'm still a bit hesitant - even if you wire this all up now, there will be a learning curve for how to wire the stuff in the future. I think it might be introducing a complexity barrier for new developers in the community. I basically though that reusing the controller-runtime existing machinery will make the unit tests easier to write, but I'm not stubborn here.

Let me know if you indeed considered the consequences and think this is better decision I'm ok with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utiltestingapi.MakeWorkload("a", "moon").Queue("foo").Obj(),
)
manager := NewManager(kClient, nil)
manager := NewManagerForUnitTests(kClient, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like how the diff is bloated by this change. Rather than impacting all tests, can we introduce the new batch period behind a dedicated feature gate, and only add a set of dedicated tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a prep PR, to reduce the diff: #9224

In the feature flag case, are you suggesting maintaining both branches - before and after decoupling request requeueing from processing requeueing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the feature flag case, are you suggesting maintaining both branches - before and after decoupling request requeueing from processing requeueing?

Yeah, good question. I think the amount of changes looks scary for the PR to be safe without a FG. If we can bring it down to reviewable state by prep PRs, maybe we can drop the FG.

If we go with the FG approach I imagine it goes to Beta directly (enabled by default), and we only have a small portion of tests testing the FG disabled. Then we graduate the FG after 2 releases. Basically, we keep FG=false just as a bailout output when something goes wrong.

continue
}
processedRoots.Insert(rootName)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a drive-by cleanup, or something necessary for the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by-cleanup. I (think) I can revert this, but later on this simplification makes sense, as we collapse keys in the requeuer and adds are cheap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's decouple to a PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I will submit as a follow-up PR

for _, clusterQueue := range cohort.ChildCQs() {
queued = queueInadmissibleWorkloads(ctx, clusterQueue, m.client) || queued
if queueInadmissibleWorkloads(ctx, clusterQueue, m.client) {
reportMetrics(m, clusterQueue.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem if we called reportMetrics here before, and so the change looks unrelated. Was it a bug before, or this is some non-obvious drive-by cleanup or something really needed here now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we were calling requeue, and then reporting metrics if anything was moved. Now that we decouple notifying requeue from the actual requeueing, we need to update metrics in the latter step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, don't you mind just introducing reportMetrics in a prep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed as reportPendingWorkloads in that PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be resolved now.

@mimowo
Copy link
Contributor

mimowo commented Feb 16, 2026

Let me propose to rephrase with the clear prefix, and clearly clarification what is the batch period, feel free to adjust, but I wanted to emphasize the direction in communicating to end user.
/release-note-edit

Scheduling: Fix the bug where inadmissible workloads would be re-queued too frequently at scale.
This resulted in excessive processing, lock contention, and starvation of workloads deeper in the queue.
The fix is to throttle the process with a batch period of 1s per CQ or Cohort.

@mimowo
Copy link
Contributor

mimowo commented Feb 16, 2026

LGTM overall, the main comments from me:

  1. it would be better to see if we can eliminate the custom functions for Unit and Integration tests. I imagine in unit tests we could use a custom clock to trigger the Reconcile.
  2. I'm wondering about a feature gate, but for 1s batching I'm leaning to think this is not needed. For larger delays like 5s. or 30s I would find this necessary, but we can consider it in the future.
  3. The PR is already large and we are going to cherrypick, so offloaing the diff by prep PRs, like drive-by cleanups is preferred

@k8s-ci-robot
Copy link
Contributor

@gabesaba: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-test-integration-baseline-main 50c86d6 link true /test pull-kueue-test-integration-baseline-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 16, 2026
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

// we should inform clusterQueue controller to broadcast the event.
if cqNames := r.cache.AddOrUpdateResourceFlavor(log, e.Object.DeepCopy()); len(cqNames) > 0 {
qcache.QueueInadmissibleWorkloads(context.Background(), r.qManager, cqNames)
qcache.NotifyRetryInadmissible(r.qManager, cqNames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see these uses of Background context eliminated 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix throughput issues at scale for BestEffortFIFO CQs with CPU and GPU flavors

5 participants