Skip to content

Commit b1da2b3

Browse files
authored
[iris] Propagate derived region/zone to worker configs (#4720)
🤖 _PR description written by Claude Code._ ## Summary Finishes the region/zone refactor started in #4681. That PR collapsed region/zone to a single source of truth (`slice_template.gcp.zone`) on the control side, but left `Autoscaler._per_group_worker_config` reading the now-empty `group.config.worker.attributes` dict. New workers stopped publishing a `region` attribute at registration, which silently locked every job with a hard `region` constraint out of fresh capacity. ## Background #4679 reported jobs sitting forever in PENDING when no scaling group matched their routing constraints. #4681 fixed that at submit time (`check_routing_feasibility` / `job_feasibility`) and, as a secondary cleanup, removed the explicit `worker.attributes.region`/`zone` entries from `_expand_tpu_pools` and `_expand_multi_zone_groups`, moving `ScalingGroup.region`/`.zone` to derive purely from `slice_template.gcp.zone`. The validator in `config.py` now even rejects YAML that sets these attributes explicitly. That refactor was correct on the control side: submit-time routing, feasibility checks, and autoscaler demand planning all derive from the slice template. The part that got missed is the *data* side: `runtime.py:_per_group_worker_config` still builds each new worker's `WorkerConfig.worker_attributes` by copying from `group.config.worker.attributes`, which is now empty for region/zone. The worker boots, `build_worker_metadata` merges its (empty) `extra_attributes`, and `register_or_refresh_worker` writes zero `region` rows into the `worker_attributes` table. ## Symptom on marin cluster Eczech's `bolinas_scaling_sweep` child `train_lm` tasks sat pending with: ``` Scheduler: Insufficient TPUs (need 4, available 0) - 2 worker(s) (with constraints=['device-type', 'device-variant', 'region', 'reservation-job']) Autoscaler: tier_blocked: 1 matching group(s) blocked by quota-pool tier monotonicity ``` There were 74 BATCH v5p-8 tasks (michaelryan's extracts) on us-east5-a workers sitting at lower priority — all marked `preemptible_by INTERACTIVE` — but preemption never fired. `_run_preemption_pass` only considers victim tasks whose worker passes the preemptor's constraint filter (`controller.py:546`), and none of the BATCH workers carried a `region` attribute, so they never entered the candidate set. Cluster-wide join of `workers.md_git_hash` × `worker_attributes` at the time: | md_git_hash | total | with `region` | |---|---:|---:| | `91aade6de` (pre-#4681 build) | 375 | 375 | | `b2ece3448` (post-#4681 build) | 381 | 0 | | `4657a5aa2` | 40 | 0 | | `d042e4872` | 9 | 0 | Last worker with `region=us-east5`: `marin-tpu-v5p-preemptible-8-us-east5-a-20260413-2127-0fa6c413-worker-0` (~21:27 UTC 2026-04-13). First without: `…20260414-0032-9ff96f3d-worker-0`. That gap lines up with the controller redeploy a few hours after #4681 merged. ~38 pending tasks across 7 users (eczech, larry, calvinxu, moojink, konwoo, ahmed, rohith) all carried a hard `region` constraint and were affected. Jobs using soft region constraints (`mode=CONSTRAINT_MODE_PREFERRED`, e.g. michaelryan's extract launchers) were unaffected — soft constraints only influence ranking, not the candidate filter.
1 parent 3f80b5f commit b1da2b3

File tree

2 files changed

+28
-2
lines changed

2 files changed

+28
-2
lines changed

lib/iris/src/iris/cluster/controller/autoscaler/runtime.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
from collections import deque
2424
from collections.abc import Sequence
2525

26-
from iris.cluster.constraints import Constraint
26+
from iris.cluster.constraints import Constraint, WellKnownAttribute
2727
from iris.cluster.providers.protocols import WorkerInfraProvider
2828
from iris.cluster.providers.types import (
2929
CloudSliceState,
@@ -379,6 +379,14 @@ def _per_group_worker_config(self, group: ScalingGroup) -> config_pb2.WorkerConf
379379
for k, v in group.config.worker.attributes.items():
380380
wc.worker_attributes[k] = v
381381

382+
region = group.region
383+
if region and not wc.worker_attributes.get(WellKnownAttribute.REGION):
384+
wc.worker_attributes[WellKnownAttribute.REGION] = region
385+
386+
zone = group.zone
387+
if zone and not wc.worker_attributes.get(WellKnownAttribute.ZONE):
388+
wc.worker_attributes[WellKnownAttribute.ZONE] = zone
389+
382390
if group.config.name:
383391
wc.worker_attributes["scale-group"] = group.config.name
384392

lib/iris/tests/cluster/controller/test_autoscaler.py

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
make_mock_slice_handle,
3030
make_mock_worker_handle,
3131
)
32-
from iris.cluster.constraints import DeviceType
32+
from iris.cluster.constraints import DeviceType, WellKnownAttribute
3333
from iris.cluster.types import WorkerStatus
3434
from iris.rpc import config_pb2, vm_pb2
3535
from iris.time_proto import duration_to_proto
@@ -1187,6 +1187,24 @@ def test_worker_attributes_injected(self):
11871187
assert wc is not None
11881188
assert wc.worker_attributes["team"] == "euw4"
11891189

1190+
def test_derives_region_and_zone_from_scale_group_when_missing(self):
1191+
"""Derived region and zone are injected when worker attrs omit them."""
1192+
base_wc = config_pb2.WorkerConfig(
1193+
docker_image="ghcr.io/marin-community/iris-worker:latest",
1194+
port=10001,
1195+
controller_address="controller:10000",
1196+
)
1197+
sg_config = make_scale_group_config(name="east-group", max_slices=5, zones=["us-east5-a"])
1198+
1199+
group = ScalingGroup(sg_config, make_mock_platform())
1200+
autoscaler = make_autoscaler({"east-group": group}, base_worker_config=base_wc)
1201+
1202+
wc = autoscaler._per_group_worker_config(group)
1203+
1204+
assert wc is not None
1205+
assert wc.worker_attributes[WellKnownAttribute.REGION] == "us-east5"
1206+
assert wc.worker_attributes[WellKnownAttribute.ZONE] == "us-east5-a"
1207+
11901208

11911209
class TestGpuScaleGroupBugs:
11921210
"""Reproduction tests for GPU scale group bugs observed on CoreWeave."""

0 commit comments

Comments
 (0)