Skip to content

Commit 6e9f1e0

Browse files
spcypptfacebook-github-bot
authored andcommitted
Change GenAI OSS runner to fix OOM
Summary: Switch to runners with larger memory runner to fix OOM for GenAI OSS CI. The GenAI OSS CI build jobs fail with ``` The self-hosted runner lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error. ``` The runners for CUDA12.8 for python 3.10 3.12 3.13 consistently failed. For example: https://github.com/pytorch/FBGEMM/actions/runs/14731902981/job/41348012076 https://github.com/pytorch/FBGEMM/actions/runs/14727090721/job/41332229097 https://github.com/pytorch/FBGEMM/actions/runs/14722058017/job/41317562384 Cause from huydhn: - The error happens when the runner runs out of memory. - This is a common bottleneck for the build job. - linux.24.large probably spawns too many processes given its higher number of CPU cores and OOM Reviewed By: huydhn Differential Revision: D74221851 fbshipit-source-id: 71ce45cd0e204ee8055cec154a02ae53fb581c96
1 parent 75d1a1f commit 6e9f1e0

File tree

2 files changed

+2
-2
lines changed

2 files changed

+2
-2
lines changed

.github/workflows/fbgemm_gpu_ci_genai.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ jobs:
7070
fail-fast: false
7171
matrix:
7272
host-machine: [
73-
{ arch: x86, instance: "linux.24xlarge" },
73+
{ arch: x86, instance: "linux.8xlarge.memory" },
7474
]
7575
python-version: [ "3.9", "3.10", "3.11", "3.12", "3.13" ]
7676
cuda-version: [ "11.8.0", "12.6.3", "12.8.0" ]

.github/workflows/fbgemm_gpu_release_genai.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ jobs:
6969
fail-fast: false
7070
matrix:
7171
host-machine: [
72-
{ arch: x86, instance: "linux.24xlarge" },
72+
{ arch: x86, instance: "linux.12xlarge.memory" },
7373
]
7474
python-version: [ "3.9", "3.10", "3.11", "3.12", "3.13" ]
7575
cuda-version: [ "11.8.0", "12.6.3", "12.8.0" ]

0 commit comments

Comments
 (0)