[iris] Add tpu_pools config sugar and allocation tier blocking#4392
[iris] Add tpu_pools config sugar and allocation tier blocking#4392
Conversation
Add tpu_pools YAML key that expands into per-(size, zone) scale groups with topology-derived fields and allocation tier metadata. When the autoscaler sees a quota-exceeded or backoff state at tier N in a quota_pool, it blocks all tiers >= N in that pool (per-zone), preventing wasteful crawl-up through larger TPU sizes. Migrate marin.yaml, marin-dev.yaml, and smoke-gcp.yaml to the new format. Update the autoscaler dashboard to group scale groups by quota pool with collapsible tier chain visualization.
|
🤖 Specification (>500 LOC) Problem: The autoscaler crawls up the priority waterfall trying larger TPU sizes when smaller ones fail. GCP TPU capacity is monotonic per-zone: if v5p-8 is unavailable, v5p-16/32/etc will also be unavailable. This wastes API rate limit tokens, adds latency, and risks accidental over-allocation. The production config also had ~35 nearly-identical scale group entries differing only in size-derived fields. Approach:
Key code: The tier blocking filter in route_demand(): Tests:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 40a6d07ab2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
yonromai
left a comment
There was a problem hiding this comment.
The overall direction looks good: the config sugar is much easier to maintain, the targeted config/routing test suites passed locally, and the dashboard changes line up with the new metadata. I left one follow-up about preserving an accurate unmet-demand reason when allocation-tier filtering removes all otherwise-matching groups.
Generated with Codex.
| matching_groups = [g for g in matching_groups if not _is_tier_blocked(g, pool_blocked)] | ||
|
|
||
| if not matching_groups: | ||
| unmet.append(UnmetDemand(entry=entry, reason=_diagnose_no_matching_group(entry, sorted_groups))) |
There was a problem hiding this comment.
P2 When allocation-tier filtering removes every hard-matching group, this falls into _diagnose_no_matching_group(...), which reports no_matching_group even though matching groups do exist and were only skipped because the pool is tier-blocked. I reproduced this with a tier-1 quota_exceeded group plus a tier-2 match: routing correctly refuses the tier-2 group, but the unmet reason becomes no groups match device=tpu:v5p-16, which is misleading for operators and the dashboard. Recommended fix: preserve the pre-filter match set and emit a tier-blocked / no-capacity reason when filtering is what emptied the candidate list.
Generated with Codex.
Resolve merge conflicts from upstream preemptible→capacity_type rename. Update all tpu_pools YAML configs and test fixtures to use the new capacity_type field. Regenerate protobuf files.
… list When allocation-tier monotonicity removes all hard-matching groups, emit a tier_blocked reason instead of falling through to no_matching_group. Operators and the dashboard now see the actual cause rather than a misleading "no groups match" message.
Add tpu_pools YAML key that expands into per-(size, zone) scale groups with topology-derived fields and allocation tier metadata. When the autoscaler sees quota-exceeded or backoff at tier N in a quota_pool, it blocks tiers >= N in that pool (per-zone), preventing wasteful crawl-up through larger TPU sizes that GCP guarantees are also unavailable. Migrates all production configs to the new format and updates the autoscaler dashboard with collapsible pool grouping and tier chain visualization. Design doc: lib/iris/docs/tpu-pool-expansion.md
Add tpu_pools YAML key that expands into per-(size, zone) scale groups with
topology-derived fields and allocation tier metadata. When the autoscaler sees
quota-exceeded or backoff at tier N in a quota_pool, it blocks tiers >= N in
that pool (per-zone), preventing wasteful crawl-up through larger TPU sizes
that GCP guarantees are also unavailable. Migrates all production configs to
the new format and updates the autoscaler dashboard with collapsible pool
grouping and tier chain visualization.
Design doc: lib/iris/docs/tpu-pool-expansion.md