Commit 6e9f1e0
Change GenAI OSS runner to fix OOM
Summary:
Switch to runners with larger memory runner to fix OOM for GenAI OSS CI.
The GenAI OSS CI build jobs fail with
```
The self-hosted runner lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
```
The runners for CUDA12.8 for python 3.10 3.12 3.13 consistently failed.
For example:
https://github.com/pytorch/FBGEMM/actions/runs/14731902981/job/41348012076
https://github.com/pytorch/FBGEMM/actions/runs/14727090721/job/41332229097
https://github.com/pytorch/FBGEMM/actions/runs/14722058017/job/41317562384
Cause from huydhn:
- The error happens when the runner runs out of memory.
- This is a common bottleneck for the build job.
- linux.24.large probably spawns too many processes given its higher number of CPU cores and OOM
Reviewed By: huydhn
Differential Revision: D74221851
fbshipit-source-id: 71ce45cd0e204ee8055cec154a02ae53fb581c961 parent 75d1a1f commit 6e9f1e0
File tree
2 files changed
+2
-2
lines changed- .github/workflows
2 files changed
+2
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
73 | | - | |
| 73 | + | |
74 | 74 | | |
75 | 75 | | |
76 | 76 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
69 | 69 | | |
70 | 70 | | |
71 | 71 | | |
72 | | - | |
| 72 | + | |
73 | 73 | | |
74 | 74 | | |
75 | 75 | | |
| |||
0 commit comments