Skip to content

Releases: pytorch/test-infra

v20250317-134413

17 Mar 13:46
9f66a9e
Compare
Choose a tag to compare
Adds scaleUpHealing chron (#6412)

# TLDR 

This change introduces a new lambda `${var.environment}-scale-up-chron`.
With all the typescript code and required terraform changes.

# What is chaning?

This PR introduces the typescript code for the new lambda, and the
related terraform changes to run the lambda every 30 minutes. The lambda
should timeout in 15 minutes. Its permissions and access should be the
same as the one in scaleUp.

It goes to hud in a URL specified in a user configuration
`retry_scale_up_chron_hud_query_url` and gets a list of instance types
and number of jobs enqueued. It then synchronously tries to deploy those
runners.

It introduces 2 new parameters in the main module:
* `retry_scale_up_chron_hud_query_url` that for now should point to
https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D
only in the installations that will benefit from it (both meta and linux
foundation PROD clusters, NOT canary) as when this variable is set to
empty string (default) the installation of this cron is not performed.
* `scale_config_org` that should point to the org where scale-config
files are defined. In our case it is `pytorch`.

[example of the
change](https://github.com/pytorch-labs/pytorch-gha-infra/pull/622/files)

# Why are we changing this?

We're introducing this change in order to provide a solution to help
recover lost requests for infra scaling. Its been proven for a while
that when there are github API outages we fail to get new jobs webhook
or fail to provision new runners. Most of the time our retry mechanism
is capable of dealing with the situation. But, in cases where we are not
receiving webhooks or other more esoteric problems, there is no way to
recover.

With this change, every 30 minutes, jobs enqueued for longer than 30
minutes for one of the autoscaled instance types, will trigger the
creation of those instances.

A few design decisions:
1 - Why rely on hud?
Hud currently already have these informations, so it should be simple to
just get it from there;

2 - Why not send a scale message and allow scaleUp to handle it?
We want to have isolation, in a way that we can easily circuit-break the
creation of enqueued instances. This also includes the isolation that
guarantees that if scaler is failing to deploy given instance type, this
mechanism won;t risk flood/overflow the main scaler that have to deal
with all other ones.

3 - why randomise the instance creation order?
So if some instance type is problematic, we are not absolutely
preventing the recovery of other instances types (just interfering).
Also we gain some time between instances creations of the same type,
allowing for a smoother operation.

4 - why a new lambda?
check number 2

# If something goes wrong?
Given we introduced as much as possible work to make sure there are
maximal isolation between the regular scaler and the cron recovery
scaler that we're introducing, we;re not foreseeing any potential gaps
that could break the main scaler and as a consequence introduce system
breakages.

Having said that, if you need to revert those changes from production,
just follow the steps:
https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.jwflgevrww4j

---------

Co-authored-by: Zain Rizvi <[email protected]>
Co-authored-by: Camyll Harajli <[email protected]>

v20250313-185750

13 Mar 19:00
b3adb27
Compare
Choose a tag to compare
Adds additional tests to getRunnerTypes, simplifies code a bit, adds …

v20250310-124810

10 Mar 12:50
fc07220
Compare
Choose a tag to compare
Reuse Ephemeral runners (#6315)

# About

With the goal to eventually move to all instances being ephemeral, we
need to fix the major limitation we have with ephemeral instances:
stockouts.

This is a problem as we currently release the instances when they finish
the job.

The goal is to make the instances to be reused before return them to AWS
by:

* Tagging ephemeral instances that finished a job with
`EphemeralRunnerFinished=finish_timestamp` so scaleUp is hinted that it
can be reused;
* scaleUp finds instances that have the `EphemeralRunnerFinished` and
try to use them to run a new job;
* scaleUp acquires lock on the instance name to avoid concurrency on
reuse;
* scaleUp mark instances re-deployed with
`EBSVolumeReplacementRequestTm` tagging when the instance was marked for
reuse;
* scaleUp remove `EphemeralRunnerFinished` so others won't find the same
instance for reuse;
* scaleUp creates the necessary SSM parameters and return the instance
to its fresh state by restoring EBS volume;

ScaleDown then:
* Avoids removing ephemeral instances by `minRunningTime` using either
creation time or `EphemeralRunnerFinished` or
`EBSVolumeReplacementRequestTm` depending on instance status;

# Disaster recovery plan:

If this PR introduces breakages, they will mostly certainly be related
to the capacity of deploying new instances/runners over having any
different behaviour in the runner itself.

So, after reverting this change, it will be important to make sure the
runner queue is under control. What should be accomplished by checking
the queue size on [hud metrics](https://hud.pytorch.org/metrics) and
running the
[send_scale_message.py](https://github.com/pytorch-labs/pytorch-gha-infra/blob/main/scale_tools/send_scale_message.py)
script to make sure those instances will be properly deployed by the
stable version of the scaler.

## Step by step to revert this change from **META**

1 - Identify if this PR is causing the identified problem: [look at
queue size](https://hud.pytorch.org/metrics) and if it is related to
impacted runners (ephemeral ones); It can also help to investigate the
[metrics on
unidash](https://www.internalfb.com/intern/unidash/dashboard/aws_infra_monitoring_for_github_actions/lambda_scaleup)
and the
[logs](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/gh-ci-scale-up?tab=monitoring)
related to the scaleUp lambda;

2 - In case of confirming the source of the problem be triggered by this
PR, revert it from main with the goal of making sure it won't impact
again in case someone else is working in other changes and accidentally
release a version of test-infra with this change.

3 - In order to restore the infrastructure to the point before this
change:

A) find the commit (or more than one, unlikely) that points to a release
version of test-infra that contains this change (will most likely be the
latest) on pytorch-gha-infra. It will be a change updating the Terrafile
pointing to a newer version of test-infra
([example](https://github.com/pytorch-labs/pytorch-gha-infra/commit/c4e888f58441b18a0fd6e19a1b935667750c6ba2)).
We maintain by standard the naming of such commit as `Release
vDATE-TIME` like `Release v20250204-163312`

B) Revert that commit from
https://github.com/pytorch-labs/pytorch-gha-infra

C) Follow [the
steps](https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.vj4fvy46wzwk)
outlined in the Pytorch GHA Infra runbook;

D) There are pointers in that document to monitoring and making sure you
are seeing recovery in metrics / queue / logs that you identified, and
how to make sure you are recovered;

4 - Restore user experience:

A) If you do have access, follow the [instructions into how to recover
ephemeral queueing
jobs](https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.ba0nyrda8jch)
on the above mentioned document;

B) Another option is to cancel jobs that are queued and trigger them
again;

v20250306-173054

06 Mar 17:33
52e4e56
Compare
Choose a tag to compare
Adding tooling and documentation for locally run tflint (#6370)

created a Makefile on `./terraform-aws-github-runner` to perform tflint
actions, and replaced the tflint calls on CI (`tflint.yml`) with this
makefile.

This makes much easier to test locally and make sure to get green
signals on CI. Reducing the loop time to fix small syntax bugs.

v20250305-171119

05 Mar 17:13
ed8eab9
Compare
Choose a tag to compare
[Bugfix] wait for ssm parameter to be created (#6359)

Sometimes SSM parameter is not properly created. After investigation I
identified that the promise is not being properly awaited. What could
cause some operations to be canceled.

v20250205-165758

05 Feb 17:00
Compare
Choose a tag to compare
20250205175711

v20250205-163646

05 Feb 16:37
Compare
Choose a tag to compare
20250205173601

v20250205-163308

05 Feb 16:34
Compare
Choose a tag to compare
20250205173224

v20250205-161117

05 Feb 16:13
Compare
Choose a tag to compare
Adds ci-queue-pct lambda code to aws/lambdas and include it to the re…

v20250204-171328

04 Feb 17:15
88e4f1e
Compare
Choose a tag to compare
Adding documentation to help develop ALI lambdas and some useful scri…