Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scaleup healing #6018

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft

scaleup healing #6018

wants to merge 16 commits into from

Conversation

ZainRizvi
Copy link
Contributor

No description provided.

Copy link

vercel bot commented Dec 6, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Mar 8, 2025 0:06am

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 6, 2024
@jeanschmidt
Copy link
Contributor

jeanschmidt commented Mar 3, 2025

based on discussions with Camyll, there are a few things we need to change here:

  • add proper monitoring to all relevant external systems access (hud);
  • add proper monitoring to all relevant application logic (number of queued jobs found, number retried, success / failure, etc, etc);
  • don't send scale requests to a SQS, but instead process it right away;
  • to avoid being stuck for a repeating failure and not processing other runners: randomise the sequence they are tried;
  • make sure there is timeout logic in the lambda (if timeouts, it aborts before AWS aborts the lambda and sends metrics);
  • Before scaling runners it is important to first check if there are stuck jobs: it can be the case that some jobs are stuck and will never run even with available runners. This is a bug we saw many times in GH side. And we can't fully trust our monitoring is accurate;
  • we need to write proper unit tests;
  • gracefully handle exceptions and failures;

(edit) fix wording for stuck jobs.

@Camyll Camyll force-pushed the zainr/ephemeral-scaleup-healing branch from 970406a to ef7ee1e Compare March 5, 2025 00:11
const dimensions = new Map([['error', error]]);

// should we add more information about this or do we not care since it'll be requeued?
this.countEntry('run.scaleupchron.failure', 1, dimensions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeanschmidt do we think this enough information or are there more details you'd expect to see added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we would like to monitor way more extensively.

  • for the cron execution:
    normally we have the .run, .success and .failure.total.count for the lambdas execution. But we also want to track the type of errors, like if a error is unexpected .failure.nonretryable.count or expected but could be retried failure.retryable.count. An example of a retryable error is stockouts, an non-retriable is invalid runner configuration. A example of a unexpected failure is a exception due to a not set value in Config.

  • for the app logic:
    Found instances, aggregated in total, aggregated by label, aggregated by org, full breakdown [org, label]. Then the same breakdown for tried to scale, success, and failure. You would be surprised how useful the try metrics is. As it should always be the same as found, and it is a simple way to spot problems. Same for the success + failure. Send the metrics for all found, then for each one you execute, send metric before trying, then with its results.

  • for the external access:
    we might just follow something like: https://github.com/pytorch/test-infra/blob/main/terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/gh-runners.ts#L56

@zxiiro
Copy link
Collaborator

zxiiro commented Mar 5, 2025

Can you add a description to this PR that provides more details about the purpose of this PR?

@jeanschmidt
Copy link
Contributor

TFLint and jest tests are failing. Please run $yarn commit-check from the terraform-aws-github-runner/modules/runners/lambdas/runners and $ make plan && make tflint from terraform-aws-github-runner.

Also, please add a detailed description on the change, and reasoning for it.

readonly scaleConfigRepo: string;
readonly scaleConfigRepoPath: string;
readonly scaleUpRecordQueueUrl: string | undefined;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this variable doing? I could not find any use in your code except fail if it is not set....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZainRizvi could you answer this? I'm also not sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I altered it. to be the constant for the URL which is what I thought it was meant to be? But lmk if not

import { getQueuedJobs } from './scale-up-chron';


test('getQueuedRunners should fetch queued runners', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as from scale-up-cron.test.ts :)

@@ -155,3 +156,35 @@ export async function scaleDown(event: ScheduledEvent, context: Context, callbac
return callback('Failed');
}
}

// eslint-disable-next-line @typescript-eslint/no-explicit-any
export async function scaleUpChron(event: ScheduledEvent, context: Context, callback: any) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to test this as well, but here tests can be more relaxed

variable "lambda_timeout_scale_up_chron" {
description = "Time out for the scale up chron lambda in seconds."
type = number
default = 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

way to short, no way we can scale hundreds that fast :)

but it is OK if you are overwriting this in the base module (didn't check).

Aim for 10 minutes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeanschmidt looks like this was just copied from the existing scaleup timeout above this. Is it an issue there as well or is the concern specific to this method, and why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, just the value, 60 seconds, I would use 900 seconds (15 minutes)

@jeanschmidt
Copy link
Contributor

For the stuck jobs, maybe we can consider ignoring jobs that are stuck for longer period. Say: 24h, we don't try to deploy runners for those jobs anymore, they are probably lost / stuck at this point.

Camyll and others added 9 commits March 5, 2025 17:58
Release preparation and validation scripts update.
1. Display python versions when dumping METADATA contents of pypi
package
2. Add torchao and torchdata changes
Sometimes SSM parameter is not properly created. After investigation I
identified that the promise is not being properly awaited. What could
cause some operations to be canceled.
Follow up after:
#6242

Newer conda is having trouble installing py3.13t artifacts, please see:

https://github.com/pytorch/vision/actions/runs/13659383307/job/38186887882#step:10:98

```
Could not solve for environment specs
The following packages are incompatible
├─ libpng =* * is installable with the potential options
│  ├─ libpng 1.6.37 would require
│  │  └─ zlib >=1.2.11,<1.3.0a0 *, which can be installed;
│  └─ libpng 1.6.39 would require
│     └─ zlib >=1.2.13,<1.3.0a0 *, which can be installed;
└─ python-freethreading =* * is not installable because it requires
   ├─ cpython =3.13.2 *, which requires
   │  └─ python =3.13.2 * with the potential options
   │     ├─ python 3.13.2 would require
   │     │  └─ libzlib >=1.3.1,<2.0a0 *, which requires
   │     │     └─ zlib ==1.3.1 *_2, which conflicts with any installable versions previously reported;
   │     └─ python 3.13.2, which can be installed;
   └─ python_abi =* *_cp313t, which requires
      └─ python =3.13 *_cp313t, which conflicts with any installable versions previously reported.
```

As a consequence vision is compiled without png support and failing
smoke tests

Test PR: #6356
Allows the viable strict promotion script to use unstable issues. Jobs
that have unstable issues open will be ignored in viable strict
promotion.

Tested with 
0ef2e938d0a9a4b90434f98b5e128d0ffacaae26 (passed, only thing failing is
libtorch debug build which has an issue right now)
96afa8a2bb78e5410a83038fd1e3f83911601700 (failed since there's something
pending)
c5d92edd5acfa56bae4f0c1057d667c6356fd6c1 (failed since lint failed)
#6363

this is a typo when refactor the benchmark code, simply remove the [], I
think visual studio auto-gen triggered it
created a Makefile on `./terraform-aws-github-runner` to perform tflint
actions, and replaced the tflint calls on CI (`tflint.yml`) with this
makefile.

This makes much easier to test locally and make sure to get green
signals on CI. Reducing the loop time to fix small syntax bugs.
@jeanschmidt
Copy link
Contributor

PR with the build, trunk and nightlies ci: https://github.com/pytorch/pytorch-canary/pull/250

@jeanschmidt
Copy link
Contributor

@jeanschmidt
Copy link
Contributor

ohh you did rebase it :(

it is difficult to parse changes in this PR, my recommendation is to always git pull origin main --no-rebase so you'll get a merge commit....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants