Change GenAI OSS runner to fix OOM #4082

spcyppt · 2025-05-06T02:36:17Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1166

Switch to runners with larger memory runner to fix OOM for GenAI OSS CI.

The GenAI OSS CI build jobs fail with

The self-hosted runner lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

The runners for CUDA12.8 for python 3.10 3.12 3.13 consistently failed.
For example:
https://github.com/pytorch/FBGEMM/actions/runs/14731 902981/job/41348012076
https://github.com/pytorch/FBGEMM/actions/runs/14727090721/job/41332229097
https://github.com/pytorch/FBGEMM/actions/runs/14722058017/job/41317562384

Cause from huydhn:

The error happens when the runner runs out of memory.
This is a common bottleneck for the build job.
linux.24.large probably spawns too many processes given its higher number of CPU cores and OOM

Differential Revision: D74221851

Summary: X-link: facebookresearch/FBGEMM#1166 Switch to runners with larger memory runner to fix OOM for GenAI OSS CI. The GenAI OSS CI build jobs fail with ``` The self-hosted runner lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error. ``` The runners for CUDA12.8 for python 3.10 3.12 3.13 consistently failed. For example: https://github.com/pytorch/FBGEMM/actions/runs/14731 902981/job/41348012076 https://github.com/pytorch/FBGEMM/actions/runs/14727090721/job/41332229097 https://github.com/pytorch/FBGEMM/actions/runs/14722058017/job/41317562384 Cause from huydhn: - The error happens when the runner runs out of memory. - This is a common bottleneck for the build job. - linux.24.large probably spawns too many processes given its higher number of CPU cores and OOM Differential Revision: D74221851

facebook-github-bot · 2025-05-06T02:36:25Z

This pull request was exported from Phabricator. Differential Revision: D74221851

netlify · 2025-05-06T02:36:36Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`b7699b3`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/681975a31c5db50008818df1
😎 Deploy Preview	https://deploy-preview-4082--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2025-05-17T02:17:35Z

Hi @spcyppt!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot added the cla signed label May 6, 2025

facebook-github-bot added the fb-exported label May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change GenAI OSS runner to fix OOM #4082

Change GenAI OSS runner to fix OOM #4082

Uh oh!

spcyppt commented May 6, 2025

Uh oh!

facebook-github-bot commented May 6, 2025

Uh oh!

netlify bot commented May 6, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented May 17, 2025

Uh oh!

Uh oh!

Change GenAI OSS runner to fix OOM #4082

Are you sure you want to change the base?

Change GenAI OSS runner to fix OOM #4082

Uh oh!

Conversation

spcyppt commented May 6, 2025

Uh oh!

facebook-github-bot commented May 6, 2025

Uh oh!

netlify bot commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented May 17, 2025

Process

Uh oh!

Uh oh!

netlify bot commented May 6, 2025 •

edited

Loading