Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds scaleUpHealing cron #6390

Closed
wants to merge 28 commits into from
Closed

Adds scaleUpHealing cron #6390

wants to merge 28 commits into from

Conversation

jeanschmidt
Copy link
Contributor

@jeanschmidt jeanschmidt commented Mar 11, 2025

Addresses issue pytorch/pytorch#143041

Adds scaleupchron job to check queue for jobs that have been queued for long periods of time. Directly calls scale up for them

cherry picked from Zain's original PR #6018

Copy link

vercel bot commented Mar 11, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Mar 13, 2025 10:03pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 11, 2025
@jeanschmidt jeanschmidt marked this pull request as draft March 11, 2025 18:31
@Camyll Camyll marked this pull request as ready for review March 11, 2025 23:34
@jeanschmidt
Copy link
Contributor Author

make plan generated from: https://github.com/pytorch-labs/pytorch-gha-infra/pull/622

P1754600843

@@ -40,7 +40,7 @@ export class Config {
readonly scaleConfigRepo: string;
readonly scaleConfigRepoPath: string;
readonly scaleUpMinQueueTimeMinutes: number;
readonly scaleUpRecordQueueUrl: string | undefined;
readonly scaleUpCronRecordQueueUrl: string | undefined;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeanschmidt do you think we should be consistent in naming with scaleUpChron vs scaleUpCron? If I were trying to understand this code I'd be confused why these are different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think we do.

@@ -65,7 +65,7 @@ resource "aws_lambda_function" "scale_up" {
REDIS_LOGIN = var.redis_login
RETRY_SCALE_UP_RECORD_DELAY_S = "60"
RETRY_SCALE_UP_RECORD_JITTER_PCT = "0.5"
RETRY_SCALE_UP_RECORD_QUEUE_URL = var.sqs_build_queue_retry.url
RETRY_SCALE_UP_CRON_RECORD_QUEUE_URL = var.sqs_build_queue_retry.url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeanschmidt, @ZainRizvi suggested we just add the actual URL here instead of using a variable since it own't change. if we leave it as a var, how do we set it?
The URL will be: https://hud.pytorch.org/api/clickhouse/queued_jobs_aggregate?parameters=%5B%5D

jeanschmidt added a commit that referenced this pull request Mar 13, 2025
…le_config.py to enforce it (#6402)

# TLDR

Adds `ephemeral` variant to all runner types, and updates
validate_scale_config.py to enforce it

# What?

Enable experiment with ephemeral runners for all workflows. The status
of the experiment can be found in the [test-infra
issue](#5132).

# Why?

Those runners are ephemeral, as eliminating nonephemeral runners is a
follow up for the recent security incident. Refreshable infrastructure
have been something we've trying to accomplish for a while, but haven't
been successful. The major blocker we face is related to stockouts and
unreliability from GH side. Most of it is because nonephemeral runners
can run other jobs and continue clearing the queue in case of a problem.
This is not possible for ephemeral runners.

# How?

To remediate stockouts, the [reuse/refresh of ephemeral
instances](#6315) have been
introduced.

In order to remediate GH side issues, [queue healing
mechanism](#6390) is being
implemented.

Also, in order to guarantee the stability of this behaviour, [additional
unit tests have been included with other
changes](#6403).

# Next steps

After merging those changes, we intend to put a small percentage of jobs
to use ephemeral runners, so we can evaluate impact on queue times and
gather statistics on reuse and tune noflap cluster behaviour. Once we
feel comfortable the experiment will be shifted to 100% and we'll
migrate all workflows to fully ephemeral instances. Eventually, all
runners will be ephemeral and the experiment and runner variants will be
removed, as we update other workflows like slow, trunk and nightlies.

# The ephemeral runners are not working properly?

Then go to the experiment on the `What?` section of this document and
set the experiment named `ephemeral` it to 0.
yangw-dev and others added 10 commits March 13, 2025 14:35
…ct (#6371)

#6294

Details:
- print `os` in test spect printout section
- store os, job_arn (mobile job) and job_conclusion to each artifact
metadata. Notice this is not github job conclusion, this is mobile job
conclusion.
- wrap post-test logics into ReportProcessor, 
    - pass aws client as parameter for test-driven purpose
    - add unit test for ReportProcessor
show monsterized failures in grouped view in HUD, example:

![image](https://github.com/user-attachments/assets/87576ca8-6ad1-4005-8f2d-8879b8c274c7)

This makes it easy to see if a failure is repeating and when it started,
without having to uncollapse or deal with shifting shard columns

- tooltip shows how many tests per failure type
- nothing changed in the non-grouped view
Some build artifacts weren't showing up under the job in the show
artifacts button because the types for some things are incorrect, so key
was actually a number and not a string, making the later read of the map
incorrect.

Old:
Missing build artifact
<img width="391" alt="image"
src="https://github.com/user-attachments/assets/a85e7420-9b0f-4efd-83e1-83d1a66a3656"
/>

New:
<img width="393" alt="image"
src="https://github.com/user-attachments/assets/4edadfd5-769c-4988-be6c-a69ae7c69db1"
/>
# Description
Issue: #6294
Prepare mobile_job yml to generate benchmark record when job fails.

## Background:  
When a git benchmark job failed (or some of the mobile job failed), we
need to generate a benchmark record to indicate that model has failures.

For instace, a benchmark job with name:`benchmark-on-device (ic3,
coreml_fp16, apple_iphone_15,
arn:aws:devicefarm:us-west-2:308535385114... / mobile-job (ios) `
when the whole job failed, we want to indicate that the model ic3 with
backend coreml_fp16 and IOS for all metrics is failed
when one of the devices in job is failed, (IPHONE 15 with os 17.1), we
want to indicate that the model ic3 with backend coreml_fp16 for IPHONE
15 with os 17.1 is failed, but others are success

key: always generate the artifact json with git job name.

## Change Details
- [yaml]add logic to generate artifact.json if any previous step fails
and there is no expected artifact.json, this makes sure we always has
the artifact json with git job name
- [script] add a flag `--new-json-output-format` to toggle the mobile
job to generate artifact.json with new format.
- see example of new json result ([s3
link](https://gha-artifacts.s3.us-east-1.amazonaws.com/device_farm/13821036006/1/artifacts/ios-artifacts-38666170088.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=ASIAUPVRELQNEU5O2WYP%2F20250312%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250312T212644Z&X-Amz-Expires=300&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEH4aCXVzLWVhc3QtMSJHMEUCIQC7%2BkVAOsGTimttLszL6u3N4HeFdSzwmPzlOYQBh%2BU%2BzwIgNjk%2FM73TZ9YfN6W92yjuRBUevYQ1BWWf0M7rmky4IT0q0AMIx%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAFGgwzMDg1MzUzODUxMTQiDCWs46GorlC4PkgCmCqkA7TQ41pTu7Pw2vUyPArSC95%2FUUHvRy5DCUEGOUwKmscwv%2B0D9jRdGfQ05E4dtVKliXhNnBRu2oH2u9WIPGKgR3fFjrVRvy2bzQhMYVjAqfUnG%2BhVO2hOKC6U33bMMNJ4SziagDSsAwHBRXl2YLsd9x4ToLubWcHFd4RtE5ZTFQFBHoB05KmzRJ5O00P6m%2BmzBvNh0T%2F2nj2l5c66VmBOe5xeyqEEHXsw3jD98NGrff7nQrONMDpRLjS74Hz%2Fz%2BGJL9RNwNQ2yJYSUdmkrTk4wi7ToNGrzpJm4Lh7wOprHQVwqpVnYaZjw7bJrTk4of4%2FE0%2FBsI1L3GqCxCt6kig02JKYBOy2nFNeRMR09xCSVQCvZE39zKZxrbilH%2FwBzHCS8KvqP14hhGbo%2F%2F08DWVBTZIgrQii0lNaPkB6c%2F0%2BCghTCQv1hUqhIY3avR3TquZzdZNeavNVU6is%2ByJtFpVZzCCH1AzeCRMcnJAlHdGyv9guD5q5wMpRICAihdmFnFy1LQZNAjSisMr0Z4zFfRKJzGdKSpdyL9D5O063WU0VVtmfI0U4fzCz38e%2BBjqEApAZr2cVZ87wIvVZOhcPBDmz%2F9mBgH5LSIK0bfkuZz6vhkUpJbmHbID6YjraMitF1ht1%2FgQtCQkHaejdA9y99K0KEwcT5JVEFaiJNhm5o7KvZJ1jlDqNAklD8brH63PQ705eszJeILnBAmKdOxTrqb83EEmg5Z2eSIjf7Cl04Si21S%2FZomsjHG1zlcHT4jZ9%2FzXPHNHFVmuMwqOVSTzMXx2BKHrOrtwW%2BbpQ8x8rOC5E9P85c86MSDefTk%2BC9Hoee16B45ywR%2BbH7I9fK%2FZ27v%2BCE0gHQglXCHTFVSp7mk18KQw67BJqq5nJDAQ%2BtEdezGj2O5iiG2Amto3XgUbeSRvTi7iF&X-Amz-Signature=49b1065e9246c807c434b8fd2dc510c014fb12a3ceb2605034da70ee2a64ca68&X-Amz-SignedHeaders=host&response-content-disposition=inline))
- [script] add git_job_name, run_report and job_reports to
artifacts.json
- git_job_name: used to build benchmark record if a git job failed [ a
trick way to grab model info]
- job_reports & run_report: we currently don't have extra info about
mobile job concolusions, this can be used to upload to time_series or
notification system for failure details.



## prs that simulate failure cases for generating logics
Mimic step failed before the benchmark test (no json
generated):#6397
Mimic step benchmark test failed but with artifact:
#6398
ExecuTorch Sync Test: pytorch/executorch#9204


## Details
when the flag is on, artifact.json is converted from 
```
[ 
   ....
]
```
to

```
{
   "git_job_name": str
    "artifacts":[ ],
    "run_report":{}
    "job_reports":[....]
}

```
This flag is temporary to in case the logics are in sync between repos.
…the support for `c.` runners (#6403)

## Support for `c.` runners

We don't really support well our canary environment with variants, not
sure why, but this condition was not present, so variants will get the
wrong naming when running at meta canary environment.
`variant.c.runner.name` instead of `c.variant.runner.name`. This fix is
very low impact and should not change anything in production.

## Simplifies the code

There is a useless `if` that I noticed when reading `getRunnerTypes`. I
am removing that conditional;

## Additional tests for getRunnerTypes

As we're adding empty variants as part of the ephemeral migration, I
want to make sure that we have green tests and we're covering this
situation in our unit tests.

The new situations to cover are:
1) Empty variants
2) meta's canary environment naming convention `c.`
…le_config.py to enforce it (#6402)

# TLDR

Adds `ephemeral` variant to all runner types, and updates
validate_scale_config.py to enforce it

# What?

Enable experiment with ephemeral runners for all workflows. The status
of the experiment can be found in the [test-infra
issue](#5132).

# Why?

Those runners are ephemeral, as eliminating nonephemeral runners is a
follow up for the recent security incident. Refreshable infrastructure
have been something we've trying to accomplish for a while, but haven't
been successful. The major blocker we face is related to stockouts and
unreliability from GH side. Most of it is because nonephemeral runners
can run other jobs and continue clearing the queue in case of a problem.
This is not possible for ephemeral runners.

# How?

To remediate stockouts, the [reuse/refresh of ephemeral
instances](#6315) have been
introduced.

In order to remediate GH side issues, [queue healing
mechanism](#6390) is being
implemented.

Also, in order to guarantee the stability of this behaviour, [additional
unit tests have been included with other
changes](#6403).

# Next steps

After merging those changes, we intend to put a small percentage of jobs
to use ephemeral runners, so we can evaluate impact on queue times and
gather statistics on reuse and tune noflap cluster behaviour. Once we
feel comfortable the experiment will be shifted to 100% and we'll
migrate all workflows to fully ephemeral instances. Eventually, all
runners will be ephemeral and the experiment and runner variants will be
removed, as we update other workflows like slow, trunk and nightlies.

# The ephemeral runners are not working properly?

Then go to the experiment on the `What?` section of this document and
set the experiment named `ephemeral` it to 0.
@jeanschmidt
Copy link
Contributor Author

this PR got rebased again.

I'll close, do a cherry-pick and pick up from here.

jeanschmidt added a commit to pytorch/ci-infra that referenced this pull request Mar 17, 2025
# Releases test-infra to version v20250317-134413

It mostly includes the new lambda `scaleUpChron`:
pytorch/test-infra#6390
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants