-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scaleup healing #6018
base: main
Are you sure you want to change the base?
scaleup healing #6018
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
based on discussions with Camyll, there are a few things we need to change here:
(edit) fix wording for stuck jobs. |
970406a
to
ef7ee1e
Compare
...m-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.test.ts
Fixed
Show fixed
Hide fixed
const dimensions = new Map([['error', error]]); | ||
|
||
// should we add more information about this or do we not care since it'll be requeued? | ||
this.countEntry('run.scaleupchron.failure', 1, dimensions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeanschmidt do we think this enough information or are there more details you'd expect to see added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we would like to monitor way more extensively.
-
for the cron execution:
normally we have the.run
,.success
and.failure.total.count
for the lambdas execution. But we also want to track the type of errors, like if a error is unexpected.failure.nonretryable.count
or expected but could be retriedfailure.retryable.count
. An example of a retryable error is stockouts, an non-retriable is invalid runner configuration. A example of a unexpected failure is a exception due to a not set value in Config. -
for the app logic:
Found instances, aggregated in total, aggregated by label, aggregated by org, full breakdown [org, label]. Then the same breakdown for tried to scale, success, and failure. You would be surprised how useful the try metrics is. As it should always be the same as found, and it is a simple way to spot problems. Same for the success + failure. Send the metrics for all found, then for each one you execute, send metric before trying, then with its results. -
for the external access:
we might just follow something like: https://github.com/pytorch/test-infra/blob/main/terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/gh-runners.ts#L56
Can you add a description to this PR that provides more details about the purpose of this PR? |
TFLint and jest tests are failing. Please run Also, please add a detailed description on the change, and reasoning for it. |
readonly scaleConfigRepo: string; | ||
readonly scaleConfigRepoPath: string; | ||
readonly scaleUpRecordQueueUrl: string | undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this variable doing? I could not find any use in your code except fail if it is not set....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZainRizvi could you answer this? I'm also not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I altered it. to be the constant for the URL which is what I thought it was meant to be? But lmk if not
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.ts
Outdated
Show resolved
Hide resolved
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.ts
Outdated
Show resolved
Hide resolved
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.ts
Outdated
Show resolved
Hide resolved
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.ts
Outdated
Show resolved
Hide resolved
...m-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.test.ts
Outdated
Show resolved
Hide resolved
import { getQueuedJobs } from './scale-up-chron'; | ||
|
||
|
||
test('getQueuedRunners should fetch queued runners', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as from scale-up-cron.test.ts :)
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.ts
Outdated
Show resolved
Hide resolved
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.ts
Outdated
Show resolved
Hide resolved
@@ -155,3 +156,35 @@ export async function scaleDown(event: ScheduledEvent, context: Context, callbac | |||
return callback('Failed'); | |||
} | |||
} | |||
|
|||
// eslint-disable-next-line @typescript-eslint/no-explicit-any | |||
export async function scaleUpChron(event: ScheduledEvent, context: Context, callback: any) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to test this as well, but here tests can be more relaxed
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.ts
Outdated
Show resolved
Hide resolved
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.ts
Outdated
Show resolved
Hide resolved
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.ts
Outdated
Show resolved
Hide resolved
terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up-chron.ts
Outdated
Show resolved
Hide resolved
variable "lambda_timeout_scale_up_chron" { | ||
description = "Time out for the scale up chron lambda in seconds." | ||
type = number | ||
default = 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
way to short, no way we can scale hundreds that fast :)
but it is OK if you are overwriting this in the base module (didn't check).
Aim for 10 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeanschmidt looks like this was just copied from the existing scaleup timeout above this. Is it an issue there as well or is the concern specific to this method, and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, just the value, 60 seconds, I would use 900 seconds (15 minutes)
For the stuck jobs, maybe we can consider ignoring jobs that are stuck for longer period. Say: 24h, we don't try to deploy runners for those jobs anymore, they are probably lost / stuck at this point. |
Release preparation and validation scripts update. 1. Display python versions when dumping METADATA contents of pypi package 2. Add torchao and torchdata changes
Sometimes SSM parameter is not properly created. After investigation I identified that the promise is not being properly awaited. What could cause some operations to be canceled.
Follow up after: #6242 Newer conda is having trouble installing py3.13t artifacts, please see: https://github.com/pytorch/vision/actions/runs/13659383307/job/38186887882#step:10:98 ``` Could not solve for environment specs The following packages are incompatible ├─ libpng =* * is installable with the potential options │ ├─ libpng 1.6.37 would require │ │ └─ zlib >=1.2.11,<1.3.0a0 *, which can be installed; │ └─ libpng 1.6.39 would require │ └─ zlib >=1.2.13,<1.3.0a0 *, which can be installed; └─ python-freethreading =* * is not installable because it requires ├─ cpython =3.13.2 *, which requires │ └─ python =3.13.2 * with the potential options │ ├─ python 3.13.2 would require │ │ └─ libzlib >=1.3.1,<2.0a0 *, which requires │ │ └─ zlib ==1.3.1 *_2, which conflicts with any installable versions previously reported; │ └─ python 3.13.2, which can be installed; └─ python_abi =* *_cp313t, which requires └─ python =3.13 *_cp313t, which conflicts with any installable versions previously reported. ``` As a consequence vision is compiled without png support and failing smoke tests Test PR: #6356
Allows the viable strict promotion script to use unstable issues. Jobs that have unstable issues open will be ignored in viable strict promotion. Tested with 0ef2e938d0a9a4b90434f98b5e128d0ffacaae26 (passed, only thing failing is libtorch debug build which has an issue right now) 96afa8a2bb78e5410a83038fd1e3f83911601700 (failed since there's something pending) c5d92edd5acfa56bae4f0c1057d667c6356fd6c1 (failed since lint failed)
#6363 this is a typo when refactor the benchmark code, simply remove the [], I think visual studio auto-gen triggered it
created a Makefile on `./terraform-aws-github-runner` to perform tflint actions, and replaced the tflint calls on CI (`tflint.yml`) with this makefile. This makes much easier to test locally and make sure to get green signals on CI. Reducing the loop time to fix small syntax bugs.
PR with the build, trunk and nightlies ci: https://github.com/pytorch/pytorch-canary/pull/250 |
run https://github.com/pytorch/test-infra/actions/runs/13731557397/job/38409446874#step:4:16 |
ohh you did rebase it :( it is difficult to parse changes in this PR, my recommendation is to always |
No description provided.