Releases: pytorch/test-infra
Releases · pytorch/test-infra
v20250418-183612
Fix env variable bug (#6539)
v20250416-182256
Update NVIDIA driver to 570.133.07 when launching the runner (#6532) This is the second part of https://github.com/pytorch/test-infra/pull/6530
v20250416-162409
Add experiment/stress test code (#6520) As part of the infra stress test, we want to introduce the capability of running code experiments leveraging Redis. This also introduce the checks we are planning to perform for the infra stress test. --------- Co-authored-by: Zain Rizvi <[email protected]>
v20250409-224100
[type] Disable the local test without verify (#6519) should comment this since this is only for some testing
v20250408-180855
[tritonbench] Add triton compile time dashboard (#6513) Located at `/tritonbench/compile_time`. Data source: Tritonbench nightly run: https://github.com/pytorch-labs/tritonbench/actions/workflows/compile-time.yaml Test plan: <img width="1039" alt="image" src="https://github.com/user-attachments/assets/be47f065-cd6e-4e73-8803-e21fe0e898b3" /> cc @Jokeren --------- Co-authored-by: Huy Do <[email protected]>
v20250404-204324
[Queue Time Histogram] Add Histogram Generator (#6504) add histogram generator, and logics to write to db table remove s3 boto since we upload to db table directly. Currently write to fortesting db for testing. Once it's complete, change to misc one
v20250327-194722
Fix scaleUpChron environment variables (#6479) `scaleUpChron` lambda requires the same environment variables as `scaleUp` lambda. But by mistake that wasn't the case. This PR simply make sure that all relevant environment variables are correctly set.
v20250327-191353
[ALI] Fix concurrency issues for tryReuse on scaleUp (#6477) As noted in [this](https://github.com/pytorch/test-infra/issues/6473) issue, there is a concurrency problem between tryReuse on scaleUp and scaleDown. This PR addresses this by making sure `tryReuse` will not use 'stale' runners (older than a certain amount), and scaleDown will only remove the ones older than a certain time (logic was already implemented). Note that for this PR to properly work, it is expected that the TF variable `minimum_running_time_in_minutes` to be increased. I believe ideally 45 minutes or more.
v20250317-134413
Adds scaleUpHealing chron (#6412) # TLDR This change introduces a new lambda `${var.environment}-scale-up-chron`. With all the typescript code and required terraform changes. # What is chaning? This PR introduces the typescript code for the new lambda, and the related terraform changes to run the lambda every 30 minutes. The lambda should timeout in 15 minutes. Its permissions and access should be the same as the one in scaleUp. It goes to hud in a URL specified in a user configuration `retry_scale_up_chron_hud_query_url` and gets a list of instance types and number of jobs enqueued. It then synchronously tries to deploy those runners. It introduces 2 new parameters in the main module: * `retry_scale_up_chron_hud_query_url` that for now should point to https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D only in the installations that will benefit from it (both meta and linux foundation PROD clusters, NOT canary) as when this variable is set to empty string (default) the installation of this cron is not performed. * `scale_config_org` that should point to the org where scale-config files are defined. In our case it is `pytorch`. [example of the change](https://github.com/pytorch-labs/pytorch-gha-infra/pull/622/files) # Why are we changing this? We're introducing this change in order to provide a solution to help recover lost requests for infra scaling. Its been proven for a while that when there are github API outages we fail to get new jobs webhook or fail to provision new runners. Most of the time our retry mechanism is capable of dealing with the situation. But, in cases where we are not receiving webhooks or other more esoteric problems, there is no way to recover. With this change, every 30 minutes, jobs enqueued for longer than 30 minutes for one of the autoscaled instance types, will trigger the creation of those instances. A few design decisions: 1 - Why rely on hud? Hud currently already have these informations, so it should be simple to just get it from there; 2 - Why not send a scale message and allow scaleUp to handle it? We want to have isolation, in a way that we can easily circuit-break the creation of enqueued instances. This also includes the isolation that guarantees that if scaler is failing to deploy given instance type, this mechanism won;t risk flood/overflow the main scaler that have to deal with all other ones. 3 - why randomise the instance creation order? So if some instance type is problematic, we are not absolutely preventing the recovery of other instances types (just interfering). Also we gain some time between instances creations of the same type, allowing for a smoother operation. 4 - why a new lambda? check number 2 # If something goes wrong? Given we introduced as much as possible work to make sure there are maximal isolation between the regular scaler and the cron recovery scaler that we're introducing, we;re not foreseeing any potential gaps that could break the main scaler and as a consequence introduce system breakages. Having said that, if you need to revert those changes from production, just follow the steps: https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit?tab=t.0#heading=h.jwflgevrww4j --------- Co-authored-by: Zain Rizvi <[email protected]> Co-authored-by: Camyll Harajli <[email protected]>
v20250313-185750
Adds additional tests to getRunnerTypes, simplifies code a bit, adds …