Skip to content

Releases: pytorch/test-infra

v20250418-183612

18 Apr 18:38
239b94f
Compare
Choose a tag to compare
Fix env variable bug (#6539)

v20250416-182256

16 Apr 18:25
501c023
Compare
Choose a tag to compare
Update NVIDIA driver to 570.133.07 when launching the runner (#6532)

This is the second part of
https://github.com/pytorch/test-infra/pull/6530

v20250416-162409

16 Apr 16:27
676e79b
Compare
Choose a tag to compare
Add experiment/stress test code (#6520)

As part of the infra stress test, we want to introduce the capability of
running code experiments leveraging Redis.

This also introduce the checks we are planning to perform for the infra
stress test.

---------

Co-authored-by: Zain Rizvi <[email protected]>

v20250409-224100

09 Apr 22:42
bb7f74f
Compare
Choose a tag to compare
[type] Disable the local test without verify (#6519)

should comment this since this is only for some testing

v20250408-180855

08 Apr 18:10
972fc89
Compare
Choose a tag to compare
[tritonbench] Add triton compile time dashboard (#6513)

Located at `/tritonbench/compile_time`.

Data source: Tritonbench nightly run:
https://github.com/pytorch-labs/tritonbench/actions/workflows/compile-time.yaml

Test plan:

<img width="1039" alt="image"
src="https://github.com/user-attachments/assets/be47f065-cd6e-4e73-8803-e21fe0e898b3"
/>



cc @Jokeren

---------

Co-authored-by: Huy Do <[email protected]>

v20250404-204324

04 Apr 20:45
4cedc0d
Compare
Choose a tag to compare
[Queue Time Histogram] Add Histogram Generator (#6504)

add histogram generator, and logics to write to db table

remove s3 boto since we upload to db table directly. Currently write to
fortesting db for testing.
Once it's complete, change to misc one

v20250327-194722

27 Mar 19:49
93ac4ec
Compare
Choose a tag to compare
Fix scaleUpChron environment variables (#6479)

`scaleUpChron` lambda requires the same environment variables as
`scaleUp` lambda. But by mistake that wasn't the case.

This PR simply make sure that all relevant environment variables are
correctly set.

v20250327-191353

27 Mar 19:16
38d53d7
Compare
Choose a tag to compare
[ALI] Fix concurrency issues for tryReuse on scaleUp (#6477)

As noted in [this](https://github.com/pytorch/test-infra/issues/6473)
issue, there is a concurrency problem between tryReuse on scaleUp and
scaleDown.

This PR addresses this by making sure `tryReuse` will not use 'stale'
runners (older than a certain amount), and scaleDown will only remove
the ones older than a certain time (logic was already implemented).

Note that for this PR to properly work, it is expected that the TF
variable `minimum_running_time_in_minutes` to be increased. I believe
ideally 45 minutes or more.

v20250317-134413

17 Mar 13:46
9f66a9e
Compare
Choose a tag to compare
Adds scaleUpHealing chron (#6412)

# TLDR 

This change introduces a new lambda `${var.environment}-scale-up-chron`.
With all the typescript code and required terraform changes.

# What is chaning?

This PR introduces the typescript code for the new lambda, and the
related terraform changes to run the lambda every 30 minutes. The lambda
should timeout in 15 minutes. Its permissions and access should be the
same as the one in scaleUp.

It goes to hud in a URL specified in a user configuration
`retry_scale_up_chron_hud_query_url` and gets a list of instance types
and number of jobs enqueued. It then synchronously tries to deploy those
runners.

It introduces 2 new parameters in the main module:
* `retry_scale_up_chron_hud_query_url` that for now should point to
https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D
only in the installations that will benefit from it (both meta and linux
foundation PROD clusters, NOT canary) as when this variable is set to
empty string (default) the installation of this cron is not performed.
* `scale_config_org` that should point to the org where scale-config
files are defined. In our case it is `pytorch`.

[example of the
change](https://github.com/pytorch-labs/pytorch-gha-infra/pull/622/files)

# Why are we changing this?

We're introducing this change in order to provide a solution to help
recover lost requests for infra scaling. Its been proven for a while
that when there are github API outages we fail to get new jobs webhook
or fail to provision new runners. Most of the time our retry mechanism
is capable of dealing with the situation. But, in cases where we are not
receiving webhooks or other more esoteric problems, there is no way to
recover.

With this change, every 30 minutes, jobs enqueued for longer than 30
minutes for one of the autoscaled instance types, will trigger the
creation of those instances.

A few design decisions:
1 - Why rely on hud?
Hud currently already have these informations, so it should be simple to
just get it from there;

2 - Why not send a scale message and allow scaleUp to handle it?
We want to have isolation, in a way that we can easily circuit-break the
creation of enqueued instances. This also includes the isolation that
guarantees that if scaler is failing to deploy given instance type, this
mechanism won;t risk flood/overflow the main scaler that have to deal
with all other ones.

3 - why randomise the instance creation order?
So if some instance type is problematic, we are not absolutely
preventing the recovery of other instances types (just interfering).
Also we gain some time between instances creations of the same type,
allowing for a smoother operation.

4 - why a new lambda?
check number 2

# If something goes wrong?
Given we introduced as much as possible work to make sure there are
maximal isolation between the regular scaler and the cron recovery
scaler that we're introducing, we;re not foreseeing any potential gaps
that could break the main scaler and as a consequence introduce system
breakages.

Having said that, if you need to revert those changes from production,
just follow the steps:
https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.jwflgevrww4j

---------

Co-authored-by: Zain Rizvi <[email protected]>
Co-authored-by: Camyll Harajli <[email protected]>

v20250313-185750

13 Mar 19:00
b3adb27
Compare
Choose a tag to compare
Adds additional tests to getRunnerTypes, simplifies code a bit, adds …