proposal: Multi-Cluster Queue Management by shellfish007 · Pull Request #7485 · karmada-io/karmada

shellfish007 · 2026-05-08T02:08:08Z

Summary

This PR proposes opt-in per-tenant queue sharding for Karmada's existing scheduler queue system, enabling multi-tenant isolation without introducing new heavyweight abstractions.

Karmada's scheduler maintains three internal queues (activeQ, backoffQ, unschedulableBindings) as global singletons. Namespaces that create a TenantQueue get their own isolated set of queues; namespaces without one continue to share a global default queue (backward compatible).

API

TenantQueue is namespace-scoped with a singleton name queue. A validating webhook rejects objects with any other name.

apiVersion: scheduling.karmada.io/v1alpha1
kind: TenantQueue
metadata:
  name: queue
  namespace: team-a
spec:
  queueingStrategy: StrictFIFO  # or BestEffortFIFO (default)

Scheduler Changes

The scheduler maintains a TenantSchedulingQueue wrapping multiple prioritySchedulingQueue instances, one per namespace:

TenantSchedulingQueue
  ├── "team-a"    → prioritySchedulingQueue{activeQ, backoffQ, unschedulableBindings} [StrictFIFO]
  ├── "team-b"    → prioritySchedulingQueue{...} [BestEffortFIFO]
  └── __default__ → prioritySchedulingQueue{...}

Pop() uses round-robin across tenant queues for fair scheduling. Bindings are ordered by priority descending, then enqueue timestamp ascending.

Key Points

Supports BestEffortFIFO (skip unschedulable head, try next) and StrictFIFO (head-of-line blocking per tenant)
Backwards compatible: feature gate TenantQueueManagement (alpha, disabled by default)
Singleton name queue enforced by validating webhook

Non-Goals

Changes to the backoffQ or unschedulableBindings data structures themselves
Per-tenant backoff and unschedulable timeout tuning
Weighted round-robin (planned for a future phase)

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a design proposal for implementing per-tenant queue sharding within the Karmada scheduler. By shifting from global queues to namespace-scoped queues, the system aims to improve multi-tenant isolation, prevent burst monopolization, and allow for configurable ordering strategies like StrictFIFO. The proposal outlines the API design, scheduling logic, and a phased implementation plan to ensure stability and backward compatibility.

Highlights

Proposal Introduction: Introduced a new design proposal for Multi-Cluster Queue Management in Karmada to enable per-tenant scheduling isolation.
TenantQueue API: Defined a new namespace-scoped TenantQueue API (scheduling.karmada.io/v1alpha1) to allow per-namespace queue configuration.
Queue Sharding: Proposed refactoring the scheduler to support per-tenant queue sharding, moving away from global singleton queues.
Scheduling Strategies: Introduced BestEffortFIFO and StrictFIFO ordering modes to provide flexibility for different workload requirements.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

karmada-bot · 2026-05-08T02:08:17Z

Welcome @shellfish007! It looks like this is your first PR to karmada-io/karmada 🎉

gemini-code-assist

Code Review

This pull request introduces per-tenant queue sharding to Karmada's scheduler through a new namespace-scoped TenantQueue API, supporting both BestEffortFIFO and StrictFIFO strategies to improve multi-tenant isolation. Feedback includes requests to document exported types and constants per the repository style guide. Additionally, suggestions were made to improve isolation for default tenants through automatic sharding and to simplify resource management by using a singleton name for TenantQueue objects instead of relying on validation webhooks.

gemini-code-assist · 2026-05-08T02:09:21Z

+    QueueingStrategy QueueingStrategy `json:"queueingStrategy,omitempty"`
+}
+
+type QueueingStrategy string


According to the repository style guide (line 7) and standard Go best practices, all exported types should be documented. Please add a concise comment describing the purpose of QueueingStrategy.

Suggested change

type QueueingStrategy string

// QueueingStrategy defines the strategy for ordering and blocking bindings in the active queue.\ntype QueueingStrategy string

References

All exported functions, methods, structs, and interfaces must be documented with clear and concise comments describing their purpose and behavior. ^(link)

gemini-code-assist · 2026-05-08T02:09:21Z

+    BestEffortFIFO QueueingStrategy = "BestEffortFIFO"
+    StrictFIFO     QueueingStrategy = "StrictFIFO"
+)


Exported constants should be documented to adhere to the repository style guide (line 7) and standard Go best practices.

Suggested change

BestEffortFIFO QueueingStrategy = "BestEffortFIFO"

StrictFIFO QueueingStrategy = "StrictFIFO"

)

// BestEffortFIFO indicates that if the head binding fails, the next one is tried.\n BestEffortFIFO QueueingStrategy = "BestEffortFIFO"\n // StrictFIFO indicates that if the head binding fails, the entire queue is blocked.\n StrictFIFO QueueingStrategy = "StrictFIFO"\n)

References

All exported functions, methods, structs, and interfaces must be documented with clear and concise comments describing their purpose and behavior. ^(link)

gemini-code-assist · 2026-05-08T02:09:21Z

+# Another namespace uses the default (BestEffortFIFO), no TenantQueue needed
+```
+
+Namespaces without a `TenantQueue` — as well as all `ClusterResourceBinding` objects (which have no namespace) — are routed to a built-in `__default__` queue that always uses `BestEffortFIFO`. The default queue participates in the same round-robin as named tenant queues, getting one scheduling turn per cycle.


The proposal states that all namespaces without a TenantQueue are routed to a single __default__ queue. This design does not provide isolation between these 'default' tenants, which may conflict with the goal of 'per-tenant isolation'. \n\nConsider automatically sharding by namespace name by default for all ResourceBinding objects. The TenantQueue resource would then serve as an optional configuration for these per-namespace shards, rather than a prerequisite for isolation.

gemini-code-assist · 2026-05-08T02:09:21Z

+### Phase 3: Stabilization (Beta)
+
+1. Promote `TenantQueue` API to `v1beta1`.
+2. Add validation webhooks (reject multiple `TenantQueue` objects per namespace).


Instead of implementing a validation webhook to restrict the number of TenantQueue objects per namespace, it is more idiomatic in Kubernetes to enforce a singleton name (such as default) for the resource. This simplifies discovery and avoids the need for cross-object validation logic.

Copilot

Pull request overview

Adds a new scheduling proposal describing “Multi-Cluster Queue Management” via per-namespace (tenant) sharded scheduler queues, aiming to improve multi-tenant isolation and fairness without introducing heavier queue abstractions.

Changes:

Introduces a new proposal document for per-tenant queue sharding in the scheduler (active/backoff/unschedulable queues).
Specifies a new namespaced TenantQueue API concept with BestEffortFIFO and StrictFIFO modes.
Describes a Kueue-inspired “heads” collection pattern for cross-tenant fairness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+- **`unschedulableBindings`** — bindings that could not be scheduled and are awaiting a cluster state change
+
+Today these three queues are global singletons. This proposal makes them **per-tenant**, and introduces a namespace-scoped `TenantQueue` API object that configures queue settings for a namespace. Since tenant = namespace = `FederatedResourceQuota` scope, no separate namespace selector is needed — one `TenantQueue` per namespace governs the queue behavior for all `ResourceBinding` objects in that namespace.
+
+---
+
+## Motivation
+


+```go
+// +genclient
+// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
+// +kubebuilder:resource:path=tenantqueues,scope=Namespaced,shortName=tq,categories={karmada-io}


+
+HOL blocking is tracked via a `blocked bool` flag on the tenant entry. The flag is cleared by an `onActiveQPush` callback on the inner queue, which fires whenever a binding is moved back to `activeQ` (backoff expiry, unschedulable flush, cluster state change).
+
+---


+| Throughput | Higher | Lower (head-of-line blocking) |
+| Ordering guarantee | Best effort | Deterministic within tenant |
+| Typical use case | Interactive / heterogeneous batch | Sequential pipelines, strict ordering |
+
+---
+
+## Design Notes
+


- Add doc comments to QueueingStrategy type and constants - Clarify queue isolation is opt-in (namespaces without TenantQueue share default) - Enforce singleton name 'queue' instead of validation webhook - Fix "creation timestamp" to "enqueue timestamp" for ordering semantics

codecov-commenter · 2026-05-08T02:58:07Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.93%. Comparing base (774db9b) to head (a87ce01).
⚠️ Report is 28 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7485      +/-   ##
==========================================
- Coverage   42.16%   41.93%   -0.24%     
==========================================
  Files         876      879       +3     
  Lines       64968    54328   -10640     
==========================================
- Hits        27395    22780    -4615     
+ Misses      35874    29826    -6048     
- Partials     1699     1722      +23

Flag	Coverage Δ
unittests	`41.93% <ø> (-0.24%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

seanlaii · 2026-05-12T13:50:18Z

/assign

mszacillo · 2026-05-12T14:00:51Z

/assign

Copilot AI review requested due to automatic review settings May 8, 2026 02:08

karmada-bot requested review from Tingtal and zhzhuang-zju May 8, 2026 02:08

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 8, 2026

Copilot started reviewing on behalf of shellfish007 May 8, 2026 02:08 View session

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

proposal: multi-cluster queue management

ca1ecca

shellfish007 force-pushed the multi-cluster-queue-management branch from c3677e5 to ca1ecca Compare May 8, 2026 02:10

Copilot AI reviewed May 8, 2026

View reviewed changes

shellfish007 added 2 commits May 7, 2026 22:27

proposal: add validation webhook to Phase 1, enforce singleton name

a87ce01

karmada-bot assigned seanlaii May 12, 2026

karmada-bot assigned mszacillo May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: Multi-Cluster Queue Management#7485

proposal: Multi-Cluster Queue Management#7485
shellfish007 wants to merge 3 commits into
karmada-io:masterfrom
shellfish007:multi-cluster-queue-management

shellfish007 commented May 8, 2026 •

edited

Loading

Uh oh!

karmada-bot commented May 8, 2026

Uh oh!

gemini-code-assist Bot commented May 8, 2026

Uh oh!

karmada-bot commented May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov-commenter commented May 8, 2026

Uh oh!

seanlaii commented May 12, 2026

Uh oh!

mszacillo commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	type QueueingStrategy string
	// QueueingStrategy defines the strategy for ordering and blocking bindings in the active queue.\ntype QueueingStrategy string


		HOL blocking is tracked via a `blocked bool` flag on the tenant entry. The flag is cleared by an `onActiveQPush` callback on the inner queue, which fires whenever a binding is moved back to `activeQ` (backoff expiry, unschedulable flush, cluster state change).

		---

Conversation

shellfish007 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

API

Scheduler Changes

Key Points

Non-Goals

Related

Uh oh!

karmada-bot commented May 8, 2026

Uh oh!

gemini-code-assist Bot commented May 8, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

karmada-bot commented May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

codecov-commenter commented May 8, 2026

Codecov Report

Uh oh!

seanlaii commented May 12, 2026

Uh oh!

mszacillo commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

shellfish007 commented May 8, 2026 •

edited

Loading