[iris] Reject jobs with unsatisfiable routing constraints at submit time#4681
Open
claude[bot] wants to merge 1 commit intomainfrom
Open
[iris] Reject jobs with unsatisfiable routing constraints at submit time#4681claude[bot] wants to merge 1 commit intomainfrom
claude[bot] wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a9109a1933
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
9f9f294 to
e77ae10
Compare
…ne override Adds an eager feasibility gate at LaunchJob so jobs whose hard routing constraints can never be satisfied fail fast instead of sitting in the pending queue. Coscheduled jobs also verify replicas are an exact multiple of some matching group's num_vms. Consolidates the predicate: one job_feasibility() + one _diagnose() in routing.py, exposed on Autoscaler as a single job_feasibility(constraints, replicas=None). service.launch_job calls it once instead of chaining two checks. Removes the worker.attributes[REGION/ZONE] override path in ScalingGroup.region/zone, which was 100% redundant with the slice_template.gcp.zone / coreweave.region fallback. The two production writers (_expand_tpu_pools and _expand_multi_zone_groups) set values identical to what the fallback derives. _validate_worker_settings now explicitly rejects REGION/ZONE in worker.attributes — stale configs fail loudly. Also fixes a latent bug the feasibility gate exposed: the fake GCP provider was synthesizing region=local on workers while groups derived region from slice_template (europe-west4), causing child jobs from zephyr coordinators to be rejected with "no groups in region local". Removed the synthetic region; local workers now report no region attribute, consistent with production configs that don't set one. Net -206 LOC.
e77ae10 to
d13d528
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds check_routing_feasibility() to the Autoscaler, called from launch_job for
all jobs (not just coscheduled). When no scaling group can satisfy the job's
hard routing constraints (device type/variant, region, zone, preemptible), the
submission is rejected with an actionable diagnostic: fuzzy-matched zone/region
suggestions and zone-vs-region confusion detection. Soft constraints are
excluded so preferences don't block submission. Also sets region=local on the
local cluster scale group to match what the fake provider advertises on workers,
fixing inherited-constraint failures for child jobs in local mode.
Fixes #4679