From b7699b34bd3667538b3e8d957f7c21bfeb8c8991 Mon Sep 17 00:00:00 2001
From: Supadchaya Puangpontip <supadchaya@meta.com>
Date: Mon, 5 May 2025 19:36:12 -0700
Subject: [PATCH] Change GenAI OSS runner to fix OOM

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1166


Switch to runners with larger memory runner to fix OOM for GenAI OSS CI.

The GenAI OSS CI build jobs fail with

```
The self-hosted runner lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
```

The runners for CUDA12.8 for python 3.10 3.12 3.13 consistently failed.
For example:
https://github.com/pytorch/FBGEMM/actions/runs/14731 902981/job/41348012076
https://github.com/pytorch/FBGEMM/actions/runs/14727090721/job/41332229097
https://github.com/pytorch/FBGEMM/actions/runs/14722058017/job/41317562384

Cause from huydhn:
- The error happens when the runner runs out of memory.
- This is a common bottleneck for the build job.
- linux.24.large probably spawns too many processes given its higher number of CPU cores and OOM

Differential Revision: D74221851
---
 .github/workflows/fbgemm_gpu_ci_genai.yml      | 2 +-
 .github/workflows/fbgemm_gpu_release_genai.yml | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/fbgemm_gpu_ci_genai.yml b/.github/workflows/fbgemm_gpu_ci_genai.yml
index c7f79afa9b..8017374b3b 100644
--- a/.github/workflows/fbgemm_gpu_ci_genai.yml
+++ b/.github/workflows/fbgemm_gpu_ci_genai.yml
@@ -70,7 +70,7 @@ jobs:
       fail-fast: false
       matrix:
         host-machine: [
-          { arch: x86, instance: "linux.24xlarge" },
+          { arch: x86, instance: "linux.8xlarge.memory" },
         ]
         python-version: [ "3.9", "3.10", "3.11", "3.12", "3.13" ]
         cuda-version: [ "11.8.0", "12.6.3", "12.8.0" ]
diff --git a/.github/workflows/fbgemm_gpu_release_genai.yml b/.github/workflows/fbgemm_gpu_release_genai.yml
index e81b081e36..6a0d9627b3 100644
--- a/.github/workflows/fbgemm_gpu_release_genai.yml
+++ b/.github/workflows/fbgemm_gpu_release_genai.yml
@@ -69,7 +69,7 @@ jobs:
       fail-fast: false
       matrix:
         host-machine: [
-          { arch: x86, instance: "linux.24xlarge" },
+          { arch: x86, instance: "linux.12xlarge.memory" },
         ]
         python-version: [ "3.9", "3.10", "3.11", "3.12", "3.13" ]
         cuda-version: [ "11.8.0", "12.6.3", "12.8.0" ]