Skip to content

[iris] Reject jobs with unsatisfiable routing constraints at submit time#4681

Open
claude[bot] wants to merge 1 commit intomainfrom
agent/20260412-fix-4679
Open

[iris] Reject jobs with unsatisfiable routing constraints at submit time#4681
claude[bot] wants to merge 1 commit intomainfrom
agent/20260412-fix-4679

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude bot commented Apr 12, 2026

Adds check_routing_feasibility() to the Autoscaler, called from launch_job for
all jobs (not just coscheduled). When no scaling group can satisfy the job's
hard routing constraints (device type/variant, region, zone, preemptible), the
submission is rejected with an actionable diagnostic: fuzzy-matched zone/region
suggestions and zone-vs-region confusion detection. Soft constraints are
excluded so preferences don't block submission. Also sets region=local on the
local cluster scale group to match what the fake provider advertises on workers,
fixing inherited-constraint failures for child jobs in local mode.

Fixes #4679

@claude claude bot added the agent-generated Created by automation/agent label Apr 12, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9109a1933

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@rjpower rjpower force-pushed the agent/20260412-fix-4679 branch from 9f9f294 to e77ae10 Compare April 13, 2026 01:03
…ne override

Adds an eager feasibility gate at LaunchJob so jobs whose hard routing
constraints can never be satisfied fail fast instead of sitting in the
pending queue. Coscheduled jobs also verify replicas are an exact
multiple of some matching group's num_vms.

Consolidates the predicate: one job_feasibility() + one _diagnose() in
routing.py, exposed on Autoscaler as a single
job_feasibility(constraints, replicas=None). service.launch_job calls it
once instead of chaining two checks.

Removes the worker.attributes[REGION/ZONE] override path in
ScalingGroup.region/zone, which was 100% redundant with the
slice_template.gcp.zone / coreweave.region fallback. The two production
writers (_expand_tpu_pools and _expand_multi_zone_groups) set values
identical to what the fallback derives. _validate_worker_settings now
explicitly rejects REGION/ZONE in worker.attributes — stale configs
fail loudly.

Also fixes a latent bug the feasibility gate exposed: the fake GCP
provider was synthesizing region=local on workers while groups derived
region from slice_template (europe-west4), causing child jobs from
zephyr coordinators to be rejected with "no groups in region local".
Removed the synthetic region; local workers now report no region
attribute, consistent with production configs that don't set one.

Net -206 LOC.
@rjpower rjpower force-pushed the agent/20260412-fix-4679 branch from e77ae10 to d13d528 Compare April 13, 2026 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

iris: reject unsatisfiable constraints

1 participant