Skip to content

[design] hpa overshoot demonstration#1019

Draft
lionelvillard wants to merge 1 commit intollm-d:mainfrom
lionelvillard:hpa-overshoot
Draft

[design] hpa overshoot demonstration#1019
lionelvillard wants to merge 1 commit intollm-d:mainfrom
lionelvillard:hpa-overshoot

Conversation

@lionelvillard
Copy link
Copy Markdown
Collaborator

No description provided.

Signed-off-by: Lionel Villard <villard@us.ibm.com>
Copilot AI review requested due to automatic review settings April 16, 2026 13:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a design note demonstrating how Kubernetes HPA scaling on external Value metrics can massively overshoot when pod startup time is long and pending pods are counted in currentReplicas.

Changes:

  • Introduces a worked example showing multiplicative (factorial) scale-up feedback with external Value metrics.
  • Adds a formal recurrence / growth analysis and an FAQ discussing common mitigations (stabilization windows, rate limits, target tuning).

Comment on lines +75 to +85
The `Value` formula at HPA cycle `n` (time `t = nP`) is:

```
desired(n) = ⌈ desired(n−1) × g · nP / T ⌉
```

Each cycle multiplies the previous desired count by a growing factor `g·nP/T`. Unrolling the recurrence:

```
desired(n) ≈ ∏_{k=1}^{n} (g·kP / T) = (gP/T)^n × n!
```
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recurrence in the formal analysis uses desired(n−1) as the next cycle’s currentReplicas, which assumes the scale target is fully applied and reflected in scale.status.replicas by the next HPA sync (and that maxReplicas / behavior.scaleUp rate limits / cluster quota aren’t capping growth). Consider stating these assumptions explicitly (or framing the math as an upper-bound / worst-case) so readers don’t interpret the factorial growth as unconditional in all real clusters.

Copilot uses AI. Check for mistakes.
| t=105 | Batch 3 ready (988 pods from t=45) | 1 053 | 10 530 req/s | queue empty, 1 043 pods idle |
| t=120 | Batch 4 ready (21 692 pods from t=60) | 22 745 | 227 450 req/s | queue empty, 22 735 pods idle |

Unlike the `AverageValue` case, the first batch (5 pods) is too small to match demand — capacity at t=75 (60 req/s) is still below the incoming rate (100 req/s), so the queue continues to grow. Only at t=90, when batch 2 arrives, does capacity finally exceed demand and the queue begins to drain. Meanwhile, 22 680 more pods are still spinning up with nothing to do. The system ends with **22 745 running pods serving 100 req/s** — 22 735 of them idle — until HPA's cooldown window allows a scale-down, which introduces yet another delay.
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“cooldown window” isn’t an HPA term and could be read as a separate mechanism. Consider referencing the concrete knobs that delay scale-down (e.g., behavior.scaleDown.stabilizationWindowSeconds and/or scale-down policies) so the operational implication is unambiguous.

Suggested change
Unlike the `AverageValue` case, the first batch (5 pods) is too small to match demand — capacity at t=75 (60 req/s) is still below the incoming rate (100 req/s), so the queue continues to grow. Only at t=90, when batch 2 arrives, does capacity finally exceed demand and the queue begins to drain. Meanwhile, 22 680 more pods are still spinning up with nothing to do. The system ends with **22 745 running pods serving 100 req/s** — 22 735 of them idle — until HPA's cooldown window allows a scale-down, which introduces yet another delay.
Unlike the `AverageValue` case, the first batch (5 pods) is too small to match demand — capacity at t=75 (60 req/s) is still below the incoming rate (100 req/s), so the queue continues to grow. Only at t=90, when batch 2 arrives, does capacity finally exceed demand and the queue begins to drain. Meanwhile, 22 680 more pods are still spinning up with nothing to do. The system ends with **22 745 running pods serving 100 req/s** — 22 735 of them idle — until HPA scale-down behavior permits a reduction in replicas (for example, after `behavior.scaleDown.stabilizationWindowSeconds` and any scale-down policies), which introduces yet another delay.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants