Skip to content

Conversation

@pkazmierczak
Copy link
Contributor

@pkazmierczak pkazmierczak commented Oct 15, 2025

Canaries for system jobs are placed on a tg.update.canary percent of eligible nodes. Some of these nodes may not be feasible, and until now we removed infeasible nodes during placement computation. However, if it happens to be that the first eligible node we picked to place a canary on is infeasible, this will lead to the scheduler halting deployment.

The solution presented here simplifies canary deployments: initially, system jobs that use canary updates get allocations placed on all eligible nodes, but before we start computing actual placements, a method called evictUnneededCanaries is called (much like evictAndPlace is for honoring MaxParallel) which removes those canary placements that are not needed. We also change the behavior of computePlacements which no longer performs node feasibility checks, as these are performed earlier for every allocation and node. This way we get accurate counts of all feasible nodes that let us correctly set deployment state fields.

Fixes: #26885
Fixes: #26886

@pkazmierczak pkazmierczak force-pushed the f-system-deployments-canaries-evict-refactor branch from 838bcd8 to e1234c1 Compare October 28, 2025 08:30
@pkazmierczak pkazmierczak force-pushed the f-system-deployments-canaries-evict-refactor branch from e1234c1 to a6cd581 Compare October 28, 2025 17:33
tgross added a commit that referenced this pull request Oct 28, 2025
Two groups on the same job cannot both have a static port assignment, but this
ends up getting configured in the update block test for system deployments. This
test setup bug has complicated landing the fix in #26953.
tgross added a commit that referenced this pull request Oct 28, 2025
Two groups on the same job cannot both have a static port assignment, but this
ends up getting configured in the update block test for system deployments. This
test setup bug has complicated landing the fix in #26953.
@pkazmierczak pkazmierczak force-pushed the f-system-deployments-canaries-evict-refactor branch from 697d0ff to 0c70ac8 Compare October 28, 2025 18:29
@pkazmierczak pkazmierczak force-pushed the f-system-deployments-canaries-evict-refactor branch from 54b7110 to e51cce3 Compare October 31, 2025 08:08
@pkazmierczak pkazmierczak marked this pull request as ready for review October 31, 2025 08:10
@pkazmierczak pkazmierczak requested review from a team as code owners October 31, 2025 08:10
… state

Previously we copied the behavior found in the generic scheduler, where
we rely on reconciler results to decide if there's enough placements
made. In the system scheduler we always know exactly how many placements
there should be based on the DesiredTotal field of the deployment state,
so a better way to check completeness of the deployment is to simplify
it and base on dstate alone.
…iler

In contrast to the cluster reconciler, in the node reconciler we go
node-by-node as opposed to alloc-by-alloc, and thus the state of the
reconciler has to be managed differently. If we override old deployments
on every run of `cancelUnneededDeployments`, we end up with
unnecessarily created deployments for job version that already had them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

4 participants