scaleup healing #6018

ZainRizvi · 2024-12-06T19:53:25Z

No description provided.

vercel · 2024-12-06T19:53:30Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Updated (UTC)
torchci	⬜️ Ignored (Inspect)	Visit Preview	Mar 8, 2025 0:06am

jeanschmidt · 2025-03-03T17:32:13Z

based on discussions with Camyll, there are a few things we need to change here:

add proper monitoring to all relevant external systems access (hud);
add proper monitoring to all relevant application logic (number of queued jobs found, number retried, success / failure, etc, etc);
don't send scale requests to a SQS, but instead process it right away;
to avoid being stuck for a repeating failure and not processing other runners: randomise the sequence they are tried;
make sure there is timeout logic in the lambda (if timeouts, it aborts before AWS aborts the lambda and sends metrics);
Before scaling runners it is important to first check if there are stuck jobs: it can be the case that some jobs are stuck and will never run even with available runners. This is a bug we saw many times in GH side. And we can't fully trust our monitoring is accurate;
we need to write proper unit tests;
gracefully handle exceptions and failures;

(edit) fix wording for stuck jobs.

...m-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.test.ts

Camyll · 2025-03-05T00:13:29Z

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/metrics.ts

+    const dimensions = new Map([['error', error]]);
+
+    // should we add more information about this or do we not care since  it'll be requeued?
+    this.countEntry('run.scaleupchron.failure', 1, dimensions);


@jeanschmidt do we think this enough information or are there more details you'd expect to see added

yeah, we would like to monitor way more extensively.

for the cron execution:
normally we have the .run, .success and .failure.total.count for the lambdas execution. But we also want to track the type of errors, like if a error is unexpected .failure.nonretryable.count or expected but could be retried failure.retryable.count. An example of a retryable error is stockouts, an non-retriable is invalid runner configuration. A example of a unexpected failure is a exception due to a not set value in Config.

for the app logic:
Found instances, aggregated in total, aggregated by label, aggregated by org, full breakdown [org, label]. Then the same breakdown for tried to scale, success, and failure. You would be surprised how useful the try metrics is. As it should always be the same as found, and it is a simple way to spot problems. Same for the success + failure. Send the metrics for all found, then for each one you execute, send metric before trying, then with its results.

for the external access:
we might just follow something like: https://github.com/pytorch/test-infra/blob/main/terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/gh-runners.ts#L56

zxiiro · 2025-03-05T13:25:43Z

Can you add a description to this PR that provides more details about the purpose of this PR?

jeanschmidt · 2025-03-05T13:31:05Z

TFLint and jest tests are failing. Please run $yarn commit-check from the terraform-aws-github-runner/modules/runners/lambdas/runners and $ make plan && make tflint from terraform-aws-github-runner.

Also, please add a detailed description on the change, and reasoning for it.

jeanschmidt · 2025-03-05T13:37:29Z

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/config.ts

  readonly scaleConfigRepo: string;
  readonly scaleConfigRepoPath: string;
+  readonly scaleUpRecordQueueUrl: string | undefined;


what is this variable doing? I could not find any use in your code except fail if it is not set....

@ZainRizvi could you answer this? I'm also not sure

I altered it. to be the constant for the URL which is what I thought it was meant to be? But lmk if not

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.ts

...m-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.test.ts

jeanschmidt · 2025-03-05T14:00:31Z

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/run-test.test.ts

+import { getQueuedJobs } from './scale-up-chron';
+
+
+test('getQueuedRunners should fetch queued runners', async () => {


same as from scale-up-cron.test.ts :)

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.ts

jeanschmidt · 2025-03-05T14:16:39Z

terraform-aws-github-runner/modules/runners/lambdas/runners/src/lambda.ts

@@ -155,3 +156,35 @@ export async function scaleDown(event: ScheduledEvent, context: Context, callbac
    return callback('Failed');
  }
 }
+
+// eslint-disable-next-line @typescript-eslint/no-explicit-any
+export async function scaleUpChron(event: ScheduledEvent, context: Context, callback: any) {


we want to test this as well, but here tests can be more relaxed

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.ts

jeanschmidt · 2025-03-05T14:26:14Z

terraform-aws-github-runner/modules/runners/variables.tf

+variable "lambda_timeout_scale_up_chron" {
+  description = "Time out for the scale up chron lambda in seconds."
+  type        = number
+  default     = 60


way to short, no way we can scale hundreds that fast :)

but it is OK if you are overwriting this in the base module (didn't check).

Aim for 10 minutes.

@jeanschmidt looks like this was just copied from the existing scaleup timeout above this. Is it an issue there as well or is the concern specific to this method, and why?

no, just the value, 60 seconds, I would use 900 seconds (15 minutes)

jeanschmidt · 2025-03-05T14:37:10Z

For the stuck jobs, maybe we can consider ignoring jobs that are stuck for longer period. Say: 24h, we don't try to deploy runners for those jobs anymore, they are probably lost / stuck at this point.

Release preparation and validation scripts update. 1. Display python versions when dumping METADATA contents of pypi package 2. Add torchao and torchdata changes

Sometimes SSM parameter is not properly created. After investigation I identified that the promise is not being properly awaited. What could cause some operations to be canceled.

Follow up after: #6242 Newer conda is having trouble installing py3.13t artifacts, please see: https://github.com/pytorch/vision/actions/runs/13659383307/job/38186887882#step:10:98 ``` Could not solve for environment specs The following packages are incompatible ├─ libpng =* * is installable with the potential options │ ├─ libpng 1.6.37 would require │ │ └─ zlib >=1.2.11,<1.3.0a0 *, which can be installed; │ └─ libpng 1.6.39 would require │ └─ zlib >=1.2.13,<1.3.0a0 *, which can be installed; └─ python-freethreading =* * is not installable because it requires ├─ cpython =3.13.2 *, which requires │ └─ python =3.13.2 * with the potential options │ ├─ python 3.13.2 would require │ │ └─ libzlib >=1.3.1,<2.0a0 *, which requires │ │ └─ zlib ==1.3.1 *_2, which conflicts with any installable versions previously reported; │ └─ python 3.13.2, which can be installed; └─ python_abi =* *_cp313t, which requires └─ python =3.13 *_cp313t, which conflicts with any installable versions previously reported. ``` As a consequence vision is compiled without png support and failing smoke tests Test PR: #6356

Allows the viable strict promotion script to use unstable issues. Jobs that have unstable issues open will be ignored in viable strict promotion. Tested with 0ef2e938d0a9a4b90434f98b5e128d0ffacaae26 (passed, only thing failing is libtorch debug build which has an issue right now) 96afa8a2bb78e5410a83038fd1e3f83911601700 (failed since there's something pending) c5d92edd5acfa56bae4f0c1057d667c6356fd6c1 (failed since lint failed)

#6363 this is a typo when refactor the benchmark code, simply remove the [], I think visual studio auto-gen triggered it

Align with pytorch/pytorch#148313

created a Makefile on `./terraform-aws-github-runner` to perform tflint actions, and replaced the tflint calls on CI (`tflint.yml`) with this makefile. This makes much easier to test locally and make sure to get green signals on CI. Reducing the loop time to fix small syntax bugs.

jeanschmidt · 2025-03-07T14:20:06Z

PR with the build, trunk and nightlies ci: https://github.com/pytorch/pytorch-canary/pull/250

jeanschmidt · 2025-03-11T15:36:18Z

run yarn commit-check before commiting ;)

https://github.com/pytorch/test-infra/actions/runs/13731557397/job/38409446874#step:4:16

jeanschmidt · 2025-03-11T15:40:38Z

ohh you did rebase it :(

it is difficult to parse changes in this PR, my recommendation is to always git pull origin main --no-rebase so you'll get a merge commit....

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 6, 2024

ZainRizvi and others added 4 commits March 4, 2025 16:10

more changes

e68beca

temp changes

6bf6896

rename columns

897c1cf

add metrics and don't use sqs queue

ef7ee1e

Camyll force-pushed the zainr/ephemeral-scaleup-healing branch from 970406a to ef7ee1e Compare March 5, 2025 00:11

github-advanced-security bot found potential problems Mar 5, 2025

View reviewed changes

...m-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.test.ts Fixed Show fixed Hide fixed

Camyll reviewed Mar 5, 2025

View reviewed changes

set timeout

bd84524