Skip to content

v20250317-134413

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 17 Mar 13:46
· 28 commits to main since this release
9f66a9e
Adds scaleUpHealing chron (#6412)

# TLDR 

This change introduces a new lambda `${var.environment}-scale-up-chron`.
With all the typescript code and required terraform changes.

# What is chaning?

This PR introduces the typescript code for the new lambda, and the
related terraform changes to run the lambda every 30 minutes. The lambda
should timeout in 15 minutes. Its permissions and access should be the
same as the one in scaleUp.

It goes to hud in a URL specified in a user configuration
`retry_scale_up_chron_hud_query_url` and gets a list of instance types
and number of jobs enqueued. It then synchronously tries to deploy those
runners.

It introduces 2 new parameters in the main module:
* `retry_scale_up_chron_hud_query_url` that for now should point to
https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D
only in the installations that will benefit from it (both meta and linux
foundation PROD clusters, NOT canary) as when this variable is set to
empty string (default) the installation of this cron is not performed.
* `scale_config_org` that should point to the org where scale-config
files are defined. In our case it is `pytorch`.

[example of the
change](https://github.com/pytorch-labs/pytorch-gha-infra/pull/622/files)

# Why are we changing this?

We're introducing this change in order to provide a solution to help
recover lost requests for infra scaling. Its been proven for a while
that when there are github API outages we fail to get new jobs webhook
or fail to provision new runners. Most of the time our retry mechanism
is capable of dealing with the situation. But, in cases where we are not
receiving webhooks or other more esoteric problems, there is no way to
recover.

With this change, every 30 minutes, jobs enqueued for longer than 30
minutes for one of the autoscaled instance types, will trigger the
creation of those instances.

A few design decisions:
1 - Why rely on hud?
Hud currently already have these informations, so it should be simple to
just get it from there;

2 - Why not send a scale message and allow scaleUp to handle it?
We want to have isolation, in a way that we can easily circuit-break the
creation of enqueued instances. This also includes the isolation that
guarantees that if scaler is failing to deploy given instance type, this
mechanism won;t risk flood/overflow the main scaler that have to deal
with all other ones.

3 - why randomise the instance creation order?
So if some instance type is problematic, we are not absolutely
preventing the recovery of other instances types (just interfering).
Also we gain some time between instances creations of the same type,
allowing for a smoother operation.

4 - why a new lambda?
check number 2

# If something goes wrong?
Given we introduced as much as possible work to make sure there are
maximal isolation between the regular scaler and the cron recovery
scaler that we're introducing, we;re not foreseeing any potential gaps
that could break the main scaler and as a consequence introduce system
breakages.

Having said that, if you need to revert those changes from production,
just follow the steps:
https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.jwflgevrww4j

---------

Co-authored-by: Zain Rizvi <[email protected]>
Co-authored-by: Camyll Harajli <[email protected]>