-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds scaleUpHealing cron #6390
Adds scaleUpHealing cron #6390
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
terraform-aws-github-runner/modules/runners/policies/lambda-scale-up-chron.json
Show resolved
Hide resolved
make plan generated from: https://github.com/pytorch-labs/pytorch-gha-infra/pull/622 P1754600843 |
@@ -40,7 +40,7 @@ export class Config { | |||
readonly scaleConfigRepo: string; | |||
readonly scaleConfigRepoPath: string; | |||
readonly scaleUpMinQueueTimeMinutes: number; | |||
readonly scaleUpRecordQueueUrl: string | undefined; | |||
readonly scaleUpCronRecordQueueUrl: string | undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeanschmidt do you think we should be consistent in naming with scaleUpChron vs scaleUpCron? If I were trying to understand this code I'd be confused why these are different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I think we do.
@@ -65,7 +65,7 @@ resource "aws_lambda_function" "scale_up" { | |||
REDIS_LOGIN = var.redis_login | |||
RETRY_SCALE_UP_RECORD_DELAY_S = "60" | |||
RETRY_SCALE_UP_RECORD_JITTER_PCT = "0.5" | |||
RETRY_SCALE_UP_RECORD_QUEUE_URL = var.sqs_build_queue_retry.url | |||
RETRY_SCALE_UP_CRON_RECORD_QUEUE_URL = var.sqs_build_queue_retry.url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeanschmidt, @ZainRizvi suggested we just add the actual URL here instead of using a variable since it own't change. if we leave it as a var, how do we set it?
The URL will be: https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/config.ts
Outdated
Show resolved
Hide resolved
…le_config.py to enforce it (#6402) # TLDR Adds `ephemeral` variant to all runner types, and updates validate_scale_config.py to enforce it # What? Enable experiment with ephemeral runners for all workflows. The status of the experiment can be found in the [test-infra issue](#5132). # Why? Those runners are ephemeral, as eliminating nonephemeral runners is a follow up for the recent security incident. Refreshable infrastructure have been something we've trying to accomplish for a while, but haven't been successful. The major blocker we face is related to stockouts and unreliability from GH side. Most of it is because nonephemeral runners can run other jobs and continue clearing the queue in case of a problem. This is not possible for ephemeral runners. # How? To remediate stockouts, the [reuse/refresh of ephemeral instances](#6315) have been introduced. In order to remediate GH side issues, [queue healing mechanism](#6390) is being implemented. Also, in order to guarantee the stability of this behaviour, [additional unit tests have been included with other changes](#6403). # Next steps After merging those changes, we intend to put a small percentage of jobs to use ephemeral runners, so we can evaluate impact on queue times and gather statistics on reuse and tune noflap cluster behaviour. Once we feel comfortable the experiment will be shifted to 100% and we'll migrate all workflows to fully ephemeral instances. Eventually, all runners will be ephemeral and the experiment and runner variants will be removed, as we update other workflows like slow, trunk and nightlies. # The ephemeral runners are not working properly? Then go to the experiment on the `What?` section of this document and set the experiment named `ephemeral` it to 0.
…ct (#6371) #6294 Details: - print `os` in test spect printout section - store os, job_arn (mobile job) and job_conclusion to each artifact metadata. Notice this is not github job conclusion, this is mobile job conclusion. - wrap post-test logics into ReportProcessor, - pass aws client as parameter for test-driven purpose - add unit test for ReportProcessor
show monsterized failures in grouped view in HUD, example:  This makes it easy to see if a failure is repeating and when it started, without having to uncollapse or deal with shifting shard columns - tooltip shows how many tests per failure type - nothing changed in the non-grouped view
Some build artifacts weren't showing up under the job in the show artifacts button because the types for some things are incorrect, so key was actually a number and not a string, making the later read of the map incorrect. Old: Missing build artifact <img width="391" alt="image" src="https://github.com/user-attachments/assets/a85e7420-9b0f-4efd-83e1-83d1a66a3656" /> New: <img width="393" alt="image" src="https://github.com/user-attachments/assets/4edadfd5-769c-4988-be6c-a69ae7c69db1" />
# Description Issue: #6294 Prepare mobile_job yml to generate benchmark record when job fails. ## Background: When a git benchmark job failed (or some of the mobile job failed), we need to generate a benchmark record to indicate that model has failures. For instace, a benchmark job with name:`benchmark-on-device (ic3, coreml_fp16, apple_iphone_15, arn:aws:devicefarm:us-west-2:308535385114... / mobile-job (ios) ` when the whole job failed, we want to indicate that the model ic3 with backend coreml_fp16 and IOS for all metrics is failed when one of the devices in job is failed, (IPHONE 15 with os 17.1), we want to indicate that the model ic3 with backend coreml_fp16 for IPHONE 15 with os 17.1 is failed, but others are success key: always generate the artifact json with git job name. ## Change Details - [yaml]add logic to generate artifact.json if any previous step fails and there is no expected artifact.json, this makes sure we always has the artifact json with git job name - [script] add a flag `--new-json-output-format` to toggle the mobile job to generate artifact.json with new format. - see example of new json result ([s3 link](https://gha-artifacts.s3.us-east-1.amazonaws.com/device_farm/13821036006/1/artifacts/ios-artifacts-38666170088.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=ASIAUPVRELQNEU5O2WYP%2F20250312%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250312T212644Z&X-Amz-Expires=300&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEH4aCXVzLWVhc3QtMSJHMEUCIQC7%2BkVAOsGTimttLszL6u3N4HeFdSzwmPzlOYQBh%2BU%2BzwIgNjk%2FM73TZ9YfN6W92yjuRBUevYQ1BWWf0M7rmky4IT0q0AMIx%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAFGgwzMDg1MzUzODUxMTQiDCWs46GorlC4PkgCmCqkA7TQ41pTu7Pw2vUyPArSC95%2FUUHvRy5DCUEGOUwKmscwv%2B0D9jRdGfQ05E4dtVKliXhNnBRu2oH2u9WIPGKgR3fFjrVRvy2bzQhMYVjAqfUnG%2BhVO2hOKC6U33bMMNJ4SziagDSsAwHBRXl2YLsd9x4ToLubWcHFd4RtE5ZTFQFBHoB05KmzRJ5O00P6m%2BmzBvNh0T%2F2nj2l5c66VmBOe5xeyqEEHXsw3jD98NGrff7nQrONMDpRLjS74Hz%2Fz%2BGJL9RNwNQ2yJYSUdmkrTk4wi7ToNGrzpJm4Lh7wOprHQVwqpVnYaZjw7bJrTk4of4%2FE0%2FBsI1L3GqCxCt6kig02JKYBOy2nFNeRMR09xCSVQCvZE39zKZxrbilH%2FwBzHCS8KvqP14hhGbo%2F%2F08DWVBTZIgrQii0lNaPkB6c%2F0%2BCghTCQv1hUqhIY3avR3TquZzdZNeavNVU6is%2ByJtFpVZzCCH1AzeCRMcnJAlHdGyv9guD5q5wMpRICAihdmFnFy1LQZNAjSisMr0Z4zFfRKJzGdKSpdyL9D5O063WU0VVtmfI0U4fzCz38e%2BBjqEApAZr2cVZ87wIvVZOhcPBDmz%2F9mBgH5LSIK0bfkuZz6vhkUpJbmHbID6YjraMitF1ht1%2FgQtCQkHaejdA9y99K0KEwcT5JVEFaiJNhm5o7KvZJ1jlDqNAklD8brH63PQ705eszJeILnBAmKdOxTrqb83EEmg5Z2eSIjf7Cl04Si21S%2FZomsjHG1zlcHT4jZ9%2FzXPHNHFVmuMwqOVSTzMXx2BKHrOrtwW%2BbpQ8x8rOC5E9P85c86MSDefTk%2BC9Hoee16B45ywR%2BbH7I9fK%2FZ27v%2BCE0gHQglXCHTFVSp7mk18KQw67BJqq5nJDAQ%2BtEdezGj2O5iiG2Amto3XgUbeSRvTi7iF&X-Amz-Signature=49b1065e9246c807c434b8fd2dc510c014fb12a3ceb2605034da70ee2a64ca68&X-Amz-SignedHeaders=host&response-content-disposition=inline)) - [script] add git_job_name, run_report and job_reports to artifacts.json - git_job_name: used to build benchmark record if a git job failed [ a trick way to grab model info] - job_reports & run_report: we currently don't have extra info about mobile job concolusions, this can be used to upload to time_series or notification system for failure details. ## prs that simulate failure cases for generating logics Mimic step failed before the benchmark test (no json generated):#6397 Mimic step benchmark test failed but with artifact: #6398 ExecuTorch Sync Test: pytorch/executorch#9204 ## Details when the flag is on, artifact.json is converted from ``` [ .... ] ``` to ``` { "git_job_name": str "artifacts":[ ], "run_report":{} "job_reports":[....] } ``` This flag is temporary to in case the logics are in sync between repos.
…the support for `c.` runners (#6403) ## Support for `c.` runners We don't really support well our canary environment with variants, not sure why, but this condition was not present, so variants will get the wrong naming when running at meta canary environment. `variant.c.runner.name` instead of `c.variant.runner.name`. This fix is very low impact and should not change anything in production. ## Simplifies the code There is a useless `if` that I noticed when reading `getRunnerTypes`. I am removing that conditional; ## Additional tests for getRunnerTypes As we're adding empty variants as part of the ephemeral migration, I want to make sure that we have green tests and we're covering this situation in our unit tests. The new situations to cover are: 1) Empty variants 2) meta's canary environment naming convention `c.`
…le_config.py to enforce it (#6402) # TLDR Adds `ephemeral` variant to all runner types, and updates validate_scale_config.py to enforce it # What? Enable experiment with ephemeral runners for all workflows. The status of the experiment can be found in the [test-infra issue](#5132). # Why? Those runners are ephemeral, as eliminating nonephemeral runners is a follow up for the recent security incident. Refreshable infrastructure have been something we've trying to accomplish for a while, but haven't been successful. The major blocker we face is related to stockouts and unreliability from GH side. Most of it is because nonephemeral runners can run other jobs and continue clearing the queue in case of a problem. This is not possible for ephemeral runners. # How? To remediate stockouts, the [reuse/refresh of ephemeral instances](#6315) have been introduced. In order to remediate GH side issues, [queue healing mechanism](#6390) is being implemented. Also, in order to guarantee the stability of this behaviour, [additional unit tests have been included with other changes](#6403). # Next steps After merging those changes, we intend to put a small percentage of jobs to use ephemeral runners, so we can evaluate impact on queue times and gather statistics on reuse and tune noflap cluster behaviour. Once we feel comfortable the experiment will be shifted to 100% and we'll migrate all workflows to fully ephemeral instances. Eventually, all runners will be ephemeral and the experiment and runner variants will be removed, as we update other workflows like slow, trunk and nightlies. # The ephemeral runners are not working properly? Then go to the experiment on the `What?` section of this document and set the experiment named `ephemeral` it to 0.
Doc Design: https://docs.google.com/document/d/1OiPv-ku_NvMgvnaMIbtnIEHTixJY28CBaKeMKJtglEk/edit?tab=t.0\ Add table schema for queue_time_histogram
this PR got rebased again. I'll close, do a cherry-pick and pick up from here. |
# Releases test-infra to version v20250317-134413 It mostly includes the new lambda `scaleUpChron`: pytorch/test-infra#6390
Addresses issue pytorch/pytorch#143041
Adds scaleupchron job to check queue for jobs that have been queued for long periods of time. Directly calls scale up for them
cherry picked from Zain's original PR #6018