Adds scaleUpHealing cron #6390

jeanschmidt · 2025-03-11T18:31:05Z

Adds scaleupchron job to check queue for jobs that have been queued for long periods of time. Directly calls scale up for them

cherry picked from Zain's original PR #6018

vercel · 2025-03-11T18:31:10Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Updated (UTC)
torchci	⬜️ Ignored (Inspect)	Visit Preview	Mar 13, 2025 10:03pm

terraform-aws-github-runner/modules/runners/policies/lambda-scale-up-chron.json

terraform-aws-github-runner/modules/runners/variables.tf

…n roles

…ron everywhere

jeanschmidt · 2025-03-12T17:16:08Z

make plan generated from: https://github.com/pytorch-labs/pytorch-gha-infra/pull/622

P1754600843

Camyll · 2025-03-12T18:47:27Z

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/config.ts

@@ -40,7 +40,7 @@ export class Config {
  readonly scaleConfigRepo: string;
  readonly scaleConfigRepoPath: string;
  readonly scaleUpMinQueueTimeMinutes: number;
-  readonly scaleUpRecordQueueUrl: string | undefined;
+  readonly scaleUpCronRecordQueueUrl: string | undefined;


@jeanschmidt do you think we should be consistent in naming with scaleUpChron vs scaleUpCron? If I were trying to understand this code I'd be confused why these are different

Good point, I think we do.

Camyll · 2025-03-12T18:51:07Z

terraform-aws-github-runner/modules/runners/scale-up.tf

@@ -65,7 +65,7 @@ resource "aws_lambda_function" "scale_up" {
      REDIS_LOGIN                          = var.redis_login
      RETRY_SCALE_UP_RECORD_DELAY_S        = "60"
      RETRY_SCALE_UP_RECORD_JITTER_PCT     = "0.5"
-      RETRY_SCALE_UP_RECORD_QUEUE_URL      = var.sqs_build_queue_retry.url
+      RETRY_SCALE_UP_CRON_RECORD_QUEUE_URL      = var.sqs_build_queue_retry.url


@jeanschmidt, @ZainRizvi suggested we just add the actual URL here instead of using a variable since it own't change. if we leave it as a var, how do we set it?
The URL will be: https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/config.ts

…le_config.py to enforce it (#6402) # TLDR Adds `ephemeral` variant to all runner types, and updates validate_scale_config.py to enforce it # What? Enable experiment with ephemeral runners for all workflows. The status of the experiment can be found in the [test-infra issue](#5132). # Why? Those runners are ephemeral, as eliminating nonephemeral runners is a follow up for the recent security incident. Refreshable infrastructure have been something we've trying to accomplish for a while, but haven't been successful. The major blocker we face is related to stockouts and unreliability from GH side. Most of it is because nonephemeral runners can run other jobs and continue clearing the queue in case of a problem. This is not possible for ephemeral runners. # How? To remediate stockouts, the [reuse/refresh of ephemeral instances](#6315) have been introduced. In order to remediate GH side issues, [queue healing mechanism](#6390) is being implemented. Also, in order to guarantee the stability of this behaviour, [additional unit tests have been included with other changes](#6403). # Next steps After merging those changes, we intend to put a small percentage of jobs to use ephemeral runners, so we can evaluate impact on queue times and gather statistics on reuse and tune noflap cluster behaviour. Once we feel comfortable the experiment will be shifted to 100% and we'll migrate all workflows to fully ephemeral instances. Eventually, all runners will be ephemeral and the experiment and runner variants will be removed, as we update other workflows like slow, trunk and nightlies. # The ephemeral runners are not working properly? Then go to the experiment on the `What?` section of this document and set the experiment named `ephemeral` it to 0.

…ct (#6371) #6294 Details: - print `os` in test spect printout section - store os, job_arn (mobile job) and job_conclusion to each artifact metadata. Notice this is not github job conclusion, this is mobile job conclusion. - wrap post-test logics into ReportProcessor, - pass aws client as parameter for test-driven purpose - add unit test for ReportProcessor

show monsterized failures in grouped view in HUD, example: ![image](https://github.com/user-attachments/assets/87576ca8-6ad1-4005-8f2d-8879b8c274c7) This makes it easy to see if a failure is repeating and when it started, without having to uncollapse or deal with shifting shard columns - tooltip shows how many tests per failure type - nothing changed in the non-grouped view

Some build artifacts weren't showing up under the job in the show artifacts button because the types for some things are incorrect, so key was actually a number and not a string, making the later read of the map incorrect. Old: Missing build artifact <img width="391" alt="image" src="https://github.com/user-attachments/assets/a85e7420-9b0f-4efd-83e1-83d1a66a3656" /> New: <img width="393" alt="image" src="https://github.com/user-attachments/assets/4edadfd5-769c-4988-be6c-a69ae7c69db1" />

# Description Issue: #6294 Prepare mobile_job yml to generate benchmark record when job fails. ## Background: When a git benchmark job failed (or some of the mobile job failed), we need to generate a benchmark record to indicate that model has failures. For instace, a benchmark job with name:`benchmark-on-device (ic3, coreml_fp16, apple_iphone_15, arn:aws:devicefarm:us-west-2:308535385114... / mobile-job (ios) ` when the whole job failed, we want to indicate that the model ic3 with backend coreml_fp16 and IOS for all metrics is failed when one of the devices in job is failed, (IPHONE 15 with os 17.1), we want to indicate that the model ic3 with backend coreml_fp16 for IPHONE 15 with os 17.1 is failed, but others are success key: always generate the artifact json with git job name. ## Change Details - [yaml]add logic to generate artifact.json if any previous step fails and there is no expected artifact.json, this makes sure we always has the artifact json with git job name - [script] add a flag `--new-json-output-format` to toggle the mobile job to generate artifact.json with new format. - see example of new json result ([s3 link](https://gha-artifacts.s3.us-east-1.amazonaws.com/device_farm/13821036006/1/artifacts/ios-artifacts-38666170088.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=ASIAUPVRELQNEU5O2WYP%2F20250312%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250312T212644Z&X-Amz-Expires=300&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEH4aCXVzLWVhc3QtMSJHMEUCIQC7%2BkVAOsGTimttLszL6u3N4HeFdSzwmPzlOYQBh%2BU%2BzwIgNjk%2FM73TZ9YfN6W92yjuRBUevYQ1BWWf0M7rmky4IT0q0AMIx%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAFGgwzMDg1MzUzODUxMTQiDCWs46GorlC4PkgCmCqkA7TQ41pTu7Pw2vUyPArSC95%2FUUHvRy5DCUEGOUwKmscwv%2B0D9jRdGfQ05E4dtVKliXhNnBRu2oH2u9WIPGKgR3fFjrVRvy2bzQhMYVjAqfUnG%2BhVO2hOKC6U33bMMNJ4SziagDSsAwHBRXl2YLsd9x4ToLubWcHFd4RtE5ZTFQFBHoB05KmzRJ5O00P6m%2BmzBvNh0T%2F2nj2l5c66VmBOe5xeyqEEHXsw3jD98NGrff7nQrONMDpRLjS74Hz%2Fz%2BGJL9RNwNQ2yJYSUdmkrTk4wi7ToNGrzpJm4Lh7wOprHQVwqpVnYaZjw7bJrTk4of4%2FE0%2FBsI1L3GqCxCt6kig02JKYBOy2nFNeRMR09xCSVQCvZE39zKZxrbilH%2FwBzHCS8KvqP14hhGbo%2F%2F08DWVBTZIgrQii0lNaPkB6c%2F0%2BCghTCQv1hUqhIY3avR3TquZzdZNeavNVU6is%2ByJtFpVZzCCH1AzeCRMcnJAlHdGyv9guD5q5wMpRICAihdmFnFy1LQZNAjSisMr0Z4zFfRKJzGdKSpdyL9D5O063WU0VVtmfI0U4fzCz38e%2BBjqEApAZr2cVZ87wIvVZOhcPBDmz%2F9mBgH5LSIK0bfkuZz6vhkUpJbmHbID6YjraMitF1ht1%2FgQtCQkHaejdA9y99K0KEwcT5JVEFaiJNhm5o7KvZJ1jlDqNAklD8brH63PQ705eszJeILnBAmKdOxTrqb83EEmg5Z2eSIjf7Cl04Si21S%2FZomsjHG1zlcHT4jZ9%2FzXPHNHFVmuMwqOVSTzMXx2BKHrOrtwW%2BbpQ8x8rOC5E9P85c86MSDefTk%2BC9Hoee16B45ywR%2BbH7I9fK%2FZ27v%2BCE0gHQglXCHTFVSp7mk18KQw67BJqq5nJDAQ%2BtEdezGj2O5iiG2Amto3XgUbeSRvTi7iF&X-Amz-Signature=49b1065e9246c807c434b8fd2dc510c014fb12a3ceb2605034da70ee2a64ca68&X-Amz-SignedHeaders=host&response-content-disposition=inline)) - [script] add git_job_name, run_report and job_reports to artifacts.json - git_job_name: used to build benchmark record if a git job failed [ a trick way to grab model info] - job_reports & run_report: we currently don't have extra info about mobile job concolusions, this can be used to upload to time_series or notification system for failure details. ## prs that simulate failure cases for generating logics Mimic step failed before the benchmark test (no json generated):#6397 Mimic step benchmark test failed but with artifact: #6398 ExecuTorch Sync Test: pytorch/executorch#9204 ## Details when the flag is on, artifact.json is converted from ``` [ .... ] ``` to ``` { "git_job_name": str "artifacts":[ ], "run_report":{} "job_reports":[....] } ``` This flag is temporary to in case the logics are in sync between repos.

…the support for `c.` runners (#6403) ## Support for `c.` runners We don't really support well our canary environment with variants, not sure why, but this condition was not present, so variants will get the wrong naming when running at meta canary environment. `variant.c.runner.name` instead of `c.variant.runner.name`. This fix is very low impact and should not change anything in production. ## Simplifies the code There is a useless `if` that I noticed when reading `getRunnerTypes`. I am removing that conditional; ## Additional tests for getRunnerTypes As we're adding empty variants as part of the ephemeral migration, I want to make sure that we have green tests and we're covering this situation in our unit tests. The new situations to cover are: 1) Empty variants 2) meta's canary environment naming convention `c.`

…le_config.py to enforce it (#6402) # TLDR Adds `ephemeral` variant to all runner types, and updates validate_scale_config.py to enforce it # What? Enable experiment with ephemeral runners for all workflows. The status of the experiment can be found in the [test-infra issue](#5132). # Why? Those runners are ephemeral, as eliminating nonephemeral runners is a follow up for the recent security incident. Refreshable infrastructure have been something we've trying to accomplish for a while, but haven't been successful. The major blocker we face is related to stockouts and unreliability from GH side. Most of it is because nonephemeral runners can run other jobs and continue clearing the queue in case of a problem. This is not possible for ephemeral runners. # How? To remediate stockouts, the [reuse/refresh of ephemeral instances](#6315) have been introduced. In order to remediate GH side issues, [queue healing mechanism](#6390) is being implemented. Also, in order to guarantee the stability of this behaviour, [additional unit tests have been included with other changes](#6403). # Next steps After merging those changes, we intend to put a small percentage of jobs to use ephemeral runners, so we can evaluate impact on queue times and gather statistics on reuse and tune noflap cluster behaviour. Once we feel comfortable the experiment will be shifted to 100% and we'll migrate all workflows to fully ephemeral instances. Eventually, all runners will be ephemeral and the experiment and runner variants will be removed, as we update other workflows like slow, trunk and nightlies. # The ephemeral runners are not working properly? Then go to the experiment on the `What?` section of this document and set the experiment named `ephemeral` it to 0.

Doc Design: https://docs.google.com/document/d/1OiPv-ku_NvMgvnaMIbtnIEHTixJY28CBaKeMKJtglEk/edit?tab=t.0\ Add table schema for queue_time_histogram

jeanschmidt · 2025-03-14T16:55:06Z

this PR got rebased again.

I'll close, do a cherry-pick and pick up from here.

# Releases test-infra to version v20250317-134413 It mostly includes the new lambda `scaleUpChron`: pytorch/test-infra#6390

ZainRizvi and others added 9 commits March 11, 2025 19:28

more changes

85aed2f

temp changes

a908ce4

rename columns

c3e3b7a

add metrics and don't use sqs queue

aabb321

set timeout

18fe3a0

add tests

f549fe0

fix tests

fd8b270

stash testing changes

e0632a4

working tests for scaleupchron

621c708

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 11, 2025

jeanschmidt marked this pull request as draft March 11, 2025 18:31

jeanschmidt and others added 4 commits March 11, 2025 19:56

Fixing linting errors

71285d9

fix build errorsclear

a4c5895

complete scaleupchron and lambda tests and add scale-up-chron.json

0419e58

add handling for no response data and move hud query to variable

4214cf8

Camyll reviewed Mar 11, 2025

View reviewed changes

terraform-aws-github-runner/modules/runners/policies/lambda-scale-up-chron.json Show resolved Hide resolved

Camyll reviewed Mar 11, 2025

View reviewed changes

terraform-aws-github-runner/modules/runners/variables.tf Outdated Show resolved Hide resolved

Camyll marked this pull request as ready for review March 11, 2025 23:34

jeanschmidt added 3 commits March 12, 2025 16:55

Typescript and other small improvements

5245d02

Copying permissions I believe are relevant from scaleUp to scaleUpCro…

d1a7ce3

…n roles

Fixes on terraform plan, permissions from scaleUp, not apply scaleUpC…

b43e52f

…ron everywhere

Camyll reviewed Mar 12, 2025

View reviewed changes

jeanschmidt mentioned this pull request Mar 13, 2025

Add ephemeral variants for all runner types, and updates validate_scale_config.py to enforce it #6402

Merged

jeanschmidt commented Mar 13, 2025

View reviewed changes

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/config.ts Outdated Show resolved Hide resolved

jeanschmidt added 2 commits March 13, 2025 19:30

Works in production!

e6e7e71

Removed test code

03b220f

yangw-dev and others added 10 commits March 13, 2025 14:35

add regex filter for cost page (#6393)

4635c2d

Update test matrix for release validations for 2.7 (#6401)

4e5c126

[Queue Time Historgam] add schema for QueueTimeHistorgam (#6355)

d571353

Doc Design: https://docs.google.com/document/d/1OiPv-ku_NvMgvnaMIbtnIEHTixJY28CBaKeMKJtglEk/edit?tab=t.0\ Add table schema for queue_time_histogram

fix test input and make naming consistent

b1c7c6e

jeanschmidt closed this Mar 14, 2025

jeanschmidt mentioned this pull request Mar 17, 2025

Bump test-infra version to v20250317-134413 pytorch/ci-infra#327

Merged

jeanschmidt added a commit to pytorch/ci-infra that referenced this pull request Mar 17, 2025

Bump test-infra version to v20250317-134413 (#327)

28c9ac1

# Releases test-infra to version v20250317-134413 It mostly includes the new lambda `scaleUpChron`: pytorch/test-infra#6390

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds scaleUpHealing cron #6390

Adds scaleUpHealing cron #6390

jeanschmidt commented Mar 11, 2025 •

edited by Camyll

Loading

vercel bot commented Mar 11, 2025 •

edited

Loading

jeanschmidt commented Mar 12, 2025

Camyll Mar 12, 2025

jeanschmidt Mar 12, 2025

Camyll Mar 12, 2025

jeanschmidt commented Mar 14, 2025

Adds scaleUpHealing cron #6390

Adds scaleUpHealing cron #6390

Conversation

jeanschmidt commented Mar 11, 2025 • edited by Camyll Loading

vercel bot commented Mar 11, 2025 • edited Loading

jeanschmidt commented Mar 12, 2025

Camyll Mar 12, 2025

Choose a reason for hiding this comment

jeanschmidt Mar 12, 2025

Choose a reason for hiding this comment

Camyll Mar 12, 2025

Choose a reason for hiding this comment

jeanschmidt commented Mar 14, 2025

jeanschmidt commented Mar 11, 2025 •

edited by Camyll

Loading

vercel bot commented Mar 11, 2025 •

edited

Loading