[iris] Add tpu_pools config sugar and allocation tier blocking by rjpower · Pull Request #4392 · marin-community/marin

rjpower · 2026-04-03T16:27:36Z

Add tpu_pools YAML key that expands into per-(size, zone) scale groups with
topology-derived fields and allocation tier metadata. When the autoscaler sees
quota-exceeded or backoff at tier N in a quota_pool, it blocks tiers >= N in
that pool (per-zone), preventing wasteful crawl-up through larger TPU sizes
that GCP guarantees are also unavailable. Migrates all production configs to
the new format and updates the autoscaler dashboard with collapsible pool
grouping and tier chain visualization.

Design doc: lib/iris/docs/tpu-pool-expansion.md

Add tpu_pools YAML key that expands into per-(size, zone) scale groups with topology-derived fields and allocation tier metadata. When the autoscaler sees a quota-exceeded or backoff state at tier N in a quota_pool, it blocks all tiers >= N in that pool (per-zone), preventing wasteful crawl-up through larger TPU sizes. Migrate marin.yaml, marin-dev.yaml, and smoke-gcp.yaml to the new format. Update the autoscaler dashboard to group scale groups by quota pool with collapsible tier chain visualization.

rjpower · 2026-04-03T16:28:00Z

🤖 Specification (>500 LOC)

Problem: The autoscaler crawls up the priority waterfall trying larger TPU sizes when smaller ones fail. GCP TPU capacity is monotonic per-zone: if v5p-8 is unavailable, v5p-16/32/etc will also be unavailable. This wastes API rate limit tokens, adds latency, and risks accidental over-allocation. The production config also had ~35 nearly-identical scale group entries differing only in size-derived fields.

Approach:

Config expansion (config.py): New _expand_tpu_pools() function runs before _expand_multi_zone_groups(). For each pool x size x zone, emits a fully-specified scale group with topology-derived fields (device_variant, num_vms, device_count from TpuTopologyInfo) and allocation metadata (quota_pool = pool_name/zone, allocation_tier = 1-based index in sorted sizes).
Proto (config.proto): Added quota_pool (string) and allocation_tier (int32) fields to ScaleGroupConfig.
Types (types.py): Added TPU_FAMILY_VARIANT_PREFIX dict mapping family names to variant prefixes (v5e -> v5litepod, v6e -> v6e, etc) and tpu_variant_name() helper.
Autoscaler (autoscaler.py): _pool_blocked_tiers() computes the minimum blocked tier per quota_pool by scanning groups in QUOTA_EXCEEDED or BACKOFF state. _is_tier_blocked() filters matching groups in route_demand() before budget assignment.
Dashboard (AutoscalerTab.vue): Groups scale groups by quota_pool with collapsible sections showing tier chain visualization ([T1 ok] -> [T2 blocked] -> [T3 blocked]). Blocked tiers render at reduced opacity.

Key code: The tier blocking filter in route_demand():

pool_blocked = _pool_blocked_tiers(sorted_groups, ts)
# ... inside entry loop, after matching_groups computed:
if pool_blocked:
    matching_groups = [g for g in matching_groups if not _is_tier_blocked(g, pool_blocked)]

Tests:

TestTpuPoolExpansion (10 tests): expansion correctness, topology derivation, priority override, zone handling, validation errors (unknown family/size, empty sizes, duplicate zones, name collision), coexistence with manual groups, multiple pools same family.
TestAllocationTierBlocking (5 tests): tier blocked when lower tier quota-exceeded, higher tiers blocked but lower unaffected, independent pools not affected, groups without pool unaffected, same-tier groups independent.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 40a6d07ab2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

yonromai

The overall direction looks good: the config sugar is much easier to maintain, the targeted config/routing test suites passed locally, and the dashboard changes line up with the new metadata. I left one follow-up about preserving an accurate unmet-demand reason when allocation-tier filtering removes all otherwise-matching groups.

Generated with Codex.

yonromai · 2026-04-03T16:49:15Z

+            matching_groups = [g for g in matching_groups if not _is_tier_blocked(g, pool_blocked)]
+
        if not matching_groups:
            unmet.append(UnmetDemand(entry=entry, reason=_diagnose_no_matching_group(entry, sorted_groups)))


P2 When allocation-tier filtering removes every hard-matching group, this falls into _diagnose_no_matching_group(...), which reports no_matching_group even though matching groups do exist and were only skipped because the pool is tier-blocked. I reproduced this with a tier-1 quota_exceeded group plus a tier-2 match: routing correctly refuses the tier-2 group, but the unmet reason becomes no groups match device=tpu:v5p-16, which is misleading for operators and the dashboard. Recommended fix: preserve the pre-filter match set and emit a tier-blocked / no-capacity reason when filtering is what emptied the candidate list.

Generated with Codex.

Resolve merge conflicts from upstream preemptible→capacity_type rename. Update all tpu_pools YAML configs and test fixtures to use the new capacity_type field. Regenerate protobuf files.

… list When allocation-tier monotonicity removes all hard-matching groups, emit a tier_blocked reason instead of falling through to no_matching_group. Operators and the dashboard now see the actual cause rather than a misleading "no groups match" message.

Add tpu_pools YAML key that expands into per-(size, zone) scale groups with topology-derived fields and allocation tier metadata. When the autoscaler sees quota-exceeded or backoff at tier N in a quota_pool, it blocks tiers >= N in that pool (per-zone), preventing wasteful crawl-up through larger TPU sizes that GCP guarantees are also unavailable. Migrates all production configs to the new format and updates the autoscaler dashboard with collapsible pool grouping and tier chain visualization. Design doc: lib/iris/docs/tpu-pool-expansion.md

rjpower added the agent-generated Created by automation/agent label Apr 3, 2026

chatgpt-codex-connector Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread lib/iris/src/iris/cluster/controller/autoscaler.py

Comment thread lib/iris/dashboard/src/components/controller/AutoscalerTab.vue

rjpower requested a review from yonromai April 3, 2026 16:33

yonromai approved these changes Apr 3, 2026

View reviewed changes

rjpower added 2 commits April 3, 2026 12:29

Merge main: resolve conflicts with capacity_type rename

faf0ba8

Resolve merge conflicts from upstream preemptible→capacity_type rename. Update all tpu_pools YAML configs and test fixtures to use the new capacity_type field. Regenerate protobuf files.

rjpower merged commit 6ffe272 into main Apr 3, 2026
36 of 37 checks passed

rjpower deleted the claude/laughing-lalande branch April 3, 2026 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Add tpu_pools config sugar and allocation tier blocking#4392

[iris] Add tpu_pools config sugar and allocation tier blocking#4392
rjpower merged 3 commits intomainfrom
claude/laughing-lalande

rjpower commented Apr 3, 2026

Uh oh!

rjpower commented Apr 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

yonromai left a comment

Uh oh!

yonromai Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjpower commented Apr 3, 2026

Uh oh!

rjpower commented Apr 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

yonromai left a comment

Choose a reason for hiding this comment

Uh oh!

yonromai Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants