From b7699b34bd3667538b3e8d957f7c21bfeb8c8991 Mon Sep 17 00:00:00 2001 From: Supadchaya Puangpontip Date: Mon, 5 May 2025 19:36:12 -0700 Subject: [PATCH] Change GenAI OSS runner to fix OOM Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/1166 Switch to runners with larger memory runner to fix OOM for GenAI OSS CI. The GenAI OSS CI build jobs fail with ``` The self-hosted runner lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error. ``` The runners for CUDA12.8 for python 3.10 3.12 3.13 consistently failed. For example: https://github.com/pytorch/FBGEMM/actions/runs/14731 902981/job/41348012076 https://github.com/pytorch/FBGEMM/actions/runs/14727090721/job/41332229097 https://github.com/pytorch/FBGEMM/actions/runs/14722058017/job/41317562384 Cause from huydhn: - The error happens when the runner runs out of memory. - This is a common bottleneck for the build job. - linux.24.large probably spawns too many processes given its higher number of CPU cores and OOM Differential Revision: D74221851 --- .github/workflows/fbgemm_gpu_ci_genai.yml | 2 +- .github/workflows/fbgemm_gpu_release_genai.yml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/fbgemm_gpu_ci_genai.yml b/.github/workflows/fbgemm_gpu_ci_genai.yml index c7f79afa9b..8017374b3b 100644 --- a/.github/workflows/fbgemm_gpu_ci_genai.yml +++ b/.github/workflows/fbgemm_gpu_ci_genai.yml @@ -70,7 +70,7 @@ jobs: fail-fast: false matrix: host-machine: [ - { arch: x86, instance: "linux.24xlarge" }, + { arch: x86, instance: "linux.8xlarge.memory" }, ] python-version: [ "3.9", "3.10", "3.11", "3.12", "3.13" ] cuda-version: [ "11.8.0", "12.6.3", "12.8.0" ] diff --git a/.github/workflows/fbgemm_gpu_release_genai.yml b/.github/workflows/fbgemm_gpu_release_genai.yml index e81b081e36..6a0d9627b3 100644 --- a/.github/workflows/fbgemm_gpu_release_genai.yml +++ b/.github/workflows/fbgemm_gpu_release_genai.yml @@ -69,7 +69,7 @@ jobs: fail-fast: false matrix: host-machine: [ - { arch: x86, instance: "linux.24xlarge" }, + { arch: x86, instance: "linux.12xlarge.memory" }, ] python-version: [ "3.9", "3.10", "3.11", "3.12", "3.13" ] cuda-version: [ "11.8.0", "12.6.3", "12.8.0" ]