Skip to content

[iris] Add tpu_pools config sugar and allocation tier blocking#4392

Merged
rjpower merged 3 commits intomainfrom
claude/laughing-lalande
Apr 3, 2026
Merged

[iris] Add tpu_pools config sugar and allocation tier blocking#4392
rjpower merged 3 commits intomainfrom
claude/laughing-lalande

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 3, 2026

Add tpu_pools YAML key that expands into per-(size, zone) scale groups with
topology-derived fields and allocation tier metadata. When the autoscaler sees
quota-exceeded or backoff at tier N in a quota_pool, it blocks tiers >= N in
that pool (per-zone), preventing wasteful crawl-up through larger TPU sizes
that GCP guarantees are also unavailable. Migrates all production configs to
the new format and updates the autoscaler dashboard with collapsible pool
grouping and tier chain visualization.

Design doc: lib/iris/docs/tpu-pool-expansion.md

Add tpu_pools YAML key that expands into per-(size, zone) scale groups
with topology-derived fields and allocation tier metadata. When the
autoscaler sees a quota-exceeded or backoff state at tier N in a
quota_pool, it blocks all tiers >= N in that pool (per-zone), preventing
wasteful crawl-up through larger TPU sizes. Migrate marin.yaml,
marin-dev.yaml, and smoke-gcp.yaml to the new format. Update the
autoscaler dashboard to group scale groups by quota pool with collapsible
tier chain visualization.
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 3, 2026
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 3, 2026

🤖 Specification (>500 LOC)

Problem: The autoscaler crawls up the priority waterfall trying larger TPU sizes when smaller ones fail. GCP TPU capacity is monotonic per-zone: if v5p-8 is unavailable, v5p-16/32/etc will also be unavailable. This wastes API rate limit tokens, adds latency, and risks accidental over-allocation. The production config also had ~35 nearly-identical scale group entries differing only in size-derived fields.

Approach:

  1. Config expansion (config.py): New _expand_tpu_pools() function runs before _expand_multi_zone_groups(). For each pool x size x zone, emits a fully-specified scale group with topology-derived fields (device_variant, num_vms, device_count from TpuTopologyInfo) and allocation metadata (quota_pool = pool_name/zone, allocation_tier = 1-based index in sorted sizes).

  2. Proto (config.proto): Added quota_pool (string) and allocation_tier (int32) fields to ScaleGroupConfig.

  3. Types (types.py): Added TPU_FAMILY_VARIANT_PREFIX dict mapping family names to variant prefixes (v5e -> v5litepod, v6e -> v6e, etc) and tpu_variant_name() helper.

  4. Autoscaler (autoscaler.py): _pool_blocked_tiers() computes the minimum blocked tier per quota_pool by scanning groups in QUOTA_EXCEEDED or BACKOFF state. _is_tier_blocked() filters matching groups in route_demand() before budget assignment.

  5. Dashboard (AutoscalerTab.vue): Groups scale groups by quota_pool with collapsible sections showing tier chain visualization ([T1 ok] -> [T2 blocked] -> [T3 blocked]). Blocked tiers render at reduced opacity.

Key code: The tier blocking filter in route_demand():

pool_blocked = _pool_blocked_tiers(sorted_groups, ts)
# ... inside entry loop, after matching_groups computed:
if pool_blocked:
    matching_groups = [g for g in matching_groups if not _is_tier_blocked(g, pool_blocked)]

Tests:

  • TestTpuPoolExpansion (10 tests): expansion correctness, topology derivation, priority override, zone handling, validation errors (unknown family/size, empty sizes, duplicate zones, name collision), coexistence with manual groups, multiple pools same family.
  • TestAllocationTierBlocking (5 tests): tier blocked when lower tier quota-exceeded, higher tiers blocked but lower unaffected, independent pools not affected, groups without pool unaffected, same-tier groups independent.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 40a6d07ab2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/controller/autoscaler.py
Comment thread lib/iris/dashboard/src/components/controller/AutoscalerTab.vue
@rjpower rjpower requested a review from yonromai April 3, 2026 16:33
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall direction looks good: the config sugar is much easier to maintain, the targeted config/routing test suites passed locally, and the dashboard changes line up with the new metadata. I left one follow-up about preserving an accurate unmet-demand reason when allocation-tier filtering removes all otherwise-matching groups.

Generated with Codex.

matching_groups = [g for g in matching_groups if not _is_tier_blocked(g, pool_blocked)]

if not matching_groups:
unmet.append(UnmetDemand(entry=entry, reason=_diagnose_no_matching_group(entry, sorted_groups)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 When allocation-tier filtering removes every hard-matching group, this falls into _diagnose_no_matching_group(...), which reports no_matching_group even though matching groups do exist and were only skipped because the pool is tier-blocked. I reproduced this with a tier-1 quota_exceeded group plus a tier-2 match: routing correctly refuses the tier-2 group, but the unmet reason becomes no groups match device=tpu:v5p-16, which is misleading for operators and the dashboard. Recommended fix: preserve the pre-filter match set and emit a tier-blocked / no-capacity reason when filtering is what emptied the candidate list.

Generated with Codex.

rjpower added 2 commits April 3, 2026 12:29
Resolve merge conflicts from upstream preemptible→capacity_type rename.
Update all tpu_pools YAML configs and test fixtures to use the new
capacity_type field. Regenerate protobuf files.
… list

When allocation-tier monotonicity removes all hard-matching groups, emit
a tier_blocked reason instead of falling through to no_matching_group.
Operators and the dashboard now see the actual cause rather than a
misleading "no groups match" message.
@rjpower rjpower merged commit 6ffe272 into main Apr 3, 2026
36 of 37 checks passed
@rjpower rjpower deleted the claude/laughing-lalande branch April 3, 2026 19:43
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
Add tpu_pools YAML key that expands into per-(size, zone) scale groups
with
topology-derived fields and allocation tier metadata. When the
autoscaler sees
quota-exceeded or backoff at tier N in a quota_pool, it blocks tiers >=
N in
that pool (per-zone), preventing wasteful crawl-up through larger TPU
sizes
that GCP guarantees are also unavailable. Migrates all production
configs to
the new format and updates the autoscaler dashboard with collapsible
pool
grouping and tier chain visualization.

Design doc: lib/iris/docs/tpu-pool-expansion.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants