Skip to content

Fix band percentage math in fleet overview#4688

Open
Helw150 wants to merge 1 commit intomainfrom
will/band-computation-fix
Open

Fix band percentage math in fleet overview#4688
Helw150 wants to merge 1 commit intomainfrom
will/band-computation-fix

Conversation

@Helw150
Copy link
Copy Markdown
Member

@Helw150 Helw150 commented Apr 13, 2026

Bands were counted per-slice as 'any VM in this band', so slices running multiple bands were double-counted and percentages could sum past 100% (e.g. 100% batch + 38% interactive). Count VMs in their dominant band and divide by total VMs instead, so band shares partition capacity.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 13, 2026

Claude finished @Helw150's task in 2m 1s —— View job


Review of PR #4688: Fix band percentage math in fleet overview

  • Read the diff and changed file
  • Analyze the band percentage logic
  • Post review feedback

Verdict: Looks good

The bug diagnosis is correct and the fix is clean. Here's my analysis:

The bug

The old code counted bands per-slice: for each ready slice, it collected unique bands across all VMs in that slice, then incremented each band's counter by 1. A slice running tasks in both batch and interactive bands was counted once for each, so percentages (computed as bandCount / totalSlices * 100) could exceed 100%.

The fix

The new code counts at VM granularity instead — each VM is assigned to its dominant band (the band with the most running tasks on that worker). Since each VM is counted at most once, band shares now partition VM capacity cleanly.

Key changes:

  • New vmTotal field — total VMs across ready slices, used as the denominator for band percentages instead of total (which counts slices).
  • Dominant-band selection — iterates workerBands for each VM and picks the band with the highest task count. VMs without a workerId or without any bands contribute to vmTotal but not to any band, so the "unallocated" capacity shows implicitly as the remainder below 100%.
  • Template fix (AutoscalerTab.vue:853) — divides by c.vmTotal instead of c.total, matching the new VM-level numerator.

Edge cases I checked

Case Behavior OK?
VM with no workerId Counted in vmTotal, not in any band Yes — correctly shows idle capacity
VM with tasks but bands.size === 0 Skipped via continue Yes
Tie between bands (e.g., 3 batch + 3 interactive) First band in Map iteration order wins Fine for dashboard display
Single-VM slices vmTotal === total, same as before Yes
Multi-VM slices (TPU pods) vmTotal > total, percentages reflect per-VM reality Yes — correct semantic for band assignment

Minor nit (non-blocking)

The inline type on line 586 is getting long:

new Map<string, { total: number; inUse: number; vmTotal: number; uptimes: number[]; regions: Map<string, number>; bands: Map<string, number>; capacityByRegion: Map<string, { statuses: string[]; failures: number }> }>()

Consider extracting to a named interface (like FleetChipAccumulator) if this grows further, but it's fine as-is for now.


@Helw150 Helw150 force-pushed the will/band-computation-fix branch from 39e0a66 to 9b494c7 Compare April 13, 2026 03:27
Bands were counted per-slice as 'any VM in this band', so slices running
multiple bands were double-counted and percentages could sum past 100%
(e.g. 100% batch + 38% interactive). Assign each slice to a single
dominant band based on task counts across its VMs, matching the
slice-level 'in use' percentage so shares partition in-use slices.
@Helw150 Helw150 force-pushed the will/band-computation-fix branch from 9b494c7 to dc8240a Compare April 13, 2026 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant