|
| 1 | +# GCE COS NVIDIA Bug Report Collector |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +A universal tool to simplify the generation of NVIDIA bug reports on Google |
| 6 | +Compute Engine (GCE) VMs that use the Container-Optimized OS (COS) guest |
| 7 | +operating system. |
| 8 | + |
| 9 | +This script provides a simple and reliable one-command experience to collect |
| 10 | +standard `nvidia-bug-report` logs. For GPUs with **Blackwell** architectures and |
| 11 | +newer, it automatically installs the |
| 12 | +[NVIDIA MFT (Nvidia Firmware Tools)](https://docs.nvidia.com/networking/display/mftv4320) |
| 13 | +to generate a more comprehensive report with deep hardware diagnostics. |
| 14 | + |
| 15 | +-------------------------------------------------------------------------------- |
| 16 | + |
| 17 | +## 🤔 The Challenge: Getting GPU Bug Report on GCE with COS |
| 18 | + |
| 19 | +When troubleshooting GPU issues, the first step is often to generate an |
| 20 | +`nvidia-bug-report`. However, doing so on a **Google Compute Engine (GCE) |
| 21 | +virtual machine that uses Container-Optimized OS (COS) as its guest operating |
| 22 | +system** could be a less trivial. |
| 23 | + |
| 24 | +COS is a minimal, security-hardened operating system from Google, designed |
| 25 | +specifically for running containers. By design, it does not include many |
| 26 | +standard packages or libraries that general-purpose debug tools often rely on |
| 27 | +and is design to mainly execute userspace programs through containers. This |
| 28 | +stripped down nature therefore requires some additional efforts to collect and |
| 29 | +export a comprehensive GPU bug report on COS systems. |
| 30 | + |
| 31 | +### 🔬 Enhanced Bug Report for Blackwell & Newer GPUs |
| 32 | + |
| 33 | +For newer GPU architectures like **NVIDIA Blackwell**, a standard NVIDIA bug |
| 34 | +report, while useful, may not be sufficient for diagnosing complex |
| 35 | +hardware-level issues, especially those related to |
| 36 | +[NVLink](https://www.nvidia.com/en-us/data-center/nvlink/). A truly |
| 37 | +comprehensive report requires deeper diagnostic data. |
| 38 | + |
| 39 | +This is where the NVIDIA MFT suite becomes essential. You do not need to |
| 40 | +interact with MFT directly; instead, the `nvidia-bug-report` script is designed |
| 41 | +to automatically leverage the MFT utilities if they are present on the system. |
| 42 | +By doing so, it can generate a far more comprehensive GPU bug report for |
| 43 | +diagnostics. |
| 44 | + |
| 45 | +When available, MFT allows the bug report to include critical, low-level |
| 46 | +hardware data such as: |
| 47 | + |
| 48 | +* The physical layer status of NVLink connections. |
| 49 | +* Internal GPU register values and configuration data. |
| 50 | +* Raw diagnostic segments generated directly by the firmware. |
| 51 | + |
| 52 | +However, setting up MFT is a cumbersome process on COS: |
| 53 | + |
| 54 | +1. **Kernel Module Handling**: A user must first locate and download the |
| 55 | + specific, **COS-signed** MFT kernel module that perfectly corresponds to |
| 56 | + their exact COS image version. Only a signed, version-matched module can be |
| 57 | + loaded into the COS kernel. |
| 58 | +2. **Userspace Program and Containerization**: Following the COS design |
| 59 | + philosophy, all applications should run in containers. This means the user |
| 60 | + must create a custom container that includes the MFT userspace programs, |
| 61 | + which also must be compatible with the kernel module. |
| 62 | +3. **Execution and Export**: The bug report generation must be triggered from |
| 63 | + within this custom container. Afterward, a mechanism is needed to export the |
| 64 | + final log file from the container out to the host VM or a GCS bucket. |
| 65 | + |
| 66 | +## 💡 Our Solution: A Smart, All-in-One Collector |
| 67 | + |
| 68 | +This script eliminates all of the aforementioned complexity. It acts as a |
| 69 | +universal collector that simplifies bug report generation for all users on GCE |
| 70 | +with COS. |
| 71 | + |
| 72 | +* **For all supported GPUs**, it automates the steps needed to generate a |
| 73 | + standard `nvidia-bug-report`. |
| 74 | +* **For Blackwell and newer GPU architectures**, it automatically detects the |
| 75 | + hardware and handles the entire MFT setup process in the background and then |
| 76 | + generates a more comprehensive bug report. |
| 77 | + |
| 78 | +This transforms the NVIDIA GPU bug report generation task on COS into a single |
| 79 | +docker command. |
| 80 | + |
| 81 | +### ✨ Key Features |
| 82 | + |
| 83 | +* **Universal Collector for GCE COS**: A single, simple command to generate an |
| 84 | + `nvidia-bug-report` on any supported GCE VM with the COS guest OS. |
| 85 | +* **Automatic MFT Enhancement**: For Blackwell and newer GPUs, the script |
| 86 | + automatically installs and configures the NVIDIA MFT suite to unlock deeper, |
| 87 | + more comprehensive hardware diagnostics. |
| 88 | +* **Optional GCS Upload**: Directly uploads the final report to a Google Cloud |
| 89 | + Storage bucket for easy sharing and analysis. |
| 90 | + |
| 91 | +## 📋 Prerequisites |
| 92 | + |
| 93 | +Before running the script, ensure you have: |
| 94 | + |
| 95 | +1. A Google Compute Engine (GCE) GPU VM instance with **Container-Optimized OS |
| 96 | + (COS)** as its guest operating system. |
| 97 | + |
| 98 | +2. The GPU driver is installed on the VM instance. |
| 99 | + * Please refer to |
| 100 | + [COS's official documentation page](\(https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#install\)) |
| 101 | + for more detail. |
| 102 | + * Sample commands to install the GPU driver and verify the installation: |
| 103 | + |
| 104 | + ```bash |
| 105 | + # Install NVIDIA GPU Driver |
| 106 | + sudo cos-extensions install gpu -- --version=latest |
| 107 | + |
| 108 | + # Make the driver installation path executable by re-mounting it. |
| 109 | + sudo mount --bind /var/lib/nvidia /var/lib/nvidia |
| 110 | + sudo mount -o remount,exec /var/lib/nvidia |
| 111 | + |
| 112 | + # Display all GPUs |
| 113 | + /var/lib/nvidia/bin/nvidia-smi |
| 114 | + ``` |
| 115 | + |
| 116 | +3. Configure Docker to use your Artifact Registry credentials when interacting |
| 117 | + with Artifact Registry. |
| 118 | + |
| 119 | + * Please refer to |
| 120 | + [Artifact Registry's authentication page](https://cloud.google.com/artifact-registry/docs/docker/authentication) |
| 121 | + for more detail. |
| 122 | + * Sample commands to configure the docker credential: |
| 123 | +
|
| 124 | + ```bash |
| 125 | + ARTIFACT_REGISTRIES="us-central1-docker.pkg.dev" |
| 126 | + docker-credential-gcr configure-docker --registries=${ARTIFACT_REGISTRIES?} |
| 127 | + ``` |
| 128 | +
|
| 129 | +4. [Optional] If you would like to export the bug report to GCS, the VM's |
| 130 | + service account must have *at least* Storage Object Creator |
| 131 | + (`roles/storage.objectCreator`) permissions for the target bucket. |
| 132 | + |
| 133 | + * Our script would attempt to create the specified GCS bucket when the |
| 134 | + specified bucket does not exist in the project. If you would like to |
| 135 | + leverage this feature, then your service account needs to have the |
| 136 | + **Storage Admin (`roles/storage.admin`)** role. |
| 137 | + |
| 138 | + * Sample commands to grant storage admin permission to your project's |
| 139 | + service account: |
| 140 | +
|
| 141 | + ```bash |
| 142 | + PROJECT=... # your project id |
| 143 | + gcloud projects add-iam-policy-binding ${PROJECT?} \ |
| 144 | + --member="serviceAccount:$(staging_gcloud iam service-accounts list --project=${PROJECT?} \ |
| 145 | + --filter="email~'-compute@developer.gserviceaccount.com'" --format="value(email)")" \ |
| 146 | + --role='roles/storage.admin' |
| 147 | + ``` |
| 148 | +
|
| 149 | +## 🚀 Quick Start |
| 150 | +
|
| 151 | +This tool is designed to be run as a Docker container. The primary method of use |
| 152 | +is a single docker run command. |
| 153 | +
|
| 154 | +Sample command to run on a VM with 8 GPUs: |
| 155 | +
|
| 156 | +Note: If you have a different number of GPUs on your system, you may need to |
| 157 | +adjust the `--device /dev/nvidia<gpu_num>:/dev/nvidia<gpu_num>` in the docker |
| 158 | +command accordingly. |
| 159 | +
|
| 160 | +Note: Exporting the final bug reports to a GCS bucket is optional. If you do not |
| 161 | +intend to export it elsewhere, you may remove the `--gcs_bucket=${GCS_BUCKET}` |
| 162 | +at the end. |
| 163 | +
|
| 164 | +```bash |
| 165 | +docker run \ |
| 166 | + --name gce-cos-bug-report \ |
| 167 | + --pull=always \ |
| 168 | + --privileged \ |
| 169 | + --volume /etc:/etc_host \ |
| 170 | + --volume /tmp:/tmp \ |
| 171 | + --volume /var/lib/nvidia:/usr/local/nvidia \ |
| 172 | + --device /dev/nvidia0:/dev/nvidia0 \ |
| 173 | + --device /dev/nvidia1:/dev/nvidia1 \ |
| 174 | + --device /dev/nvidia2:/dev/nvidia2 \ |
| 175 | + --device /dev/nvidia3:/dev/nvidia3 \ |
| 176 | + --device /dev/nvidia4:/dev/nvidia4 \ |
| 177 | + --device /dev/nvidia5:/dev/nvidia5 \ |
| 178 | + --device /dev/nvidia6:/dev/nvidia6 \ |
| 179 | + --device /dev/nvidia7:/dev/nvidia7 \ |
| 180 | + --device /dev/nvidia-uvm:/dev/nvidia-uvm \ |
| 181 | + --device /dev/nvidiactl:/dev/nvidiactl \ |
| 182 | +us-central1-docker.pkg.dev/gce-ai-infra/gce-cos-nvidia-bug-report-repo/gce-cos-nvidia-bug-report:latest \ |
| 183 | +--gcs_bucket=${GCS_BUCKET} |
| 184 | +``` |
| 185 | +
|
| 186 | +### 📝 Example Output |
| 187 | +
|
| 188 | +```bash |
| 189 | +I0624 21:37:25.424124 137683091463040 gce-cos-nvidia-bug-report.py:817] Bug report logs are available locally at: /tmp/nvidia_bug_reports/utc_2025_06_24_21_35_44/vm_id_2858600067712410553 |
| 190 | +I0624 21:37:25.794605 137683091463040 gce-cos-nvidia-bug-report.py:312] Bucket [my-nv-bug-reports] already exists in the project. |
| 191 | +I0624 21:37:26.308939 137683091463040 gce-cos-nvidia-bug-report.py:834] Bug report logs are available at: https://pantheon.corp.google.com/storage/browser/my-nv-bug-reports/bug_report/utc_2025_06_24_21_35_44/vm_id_2858600067712410553 |
| 192 | +``` |
| 193 | +
|
| 194 | +There will be two files getting generated as final outputs: i.e. |
| 195 | +`instance_info.txt` and `nvidia-bug-report.log.gz`. |
| 196 | +
|
| 197 | +```bash |
| 198 | +$ ls -la /tmp/nvidia_bug_reports/utc_2025_06_24_21_35_44/vm_id_2858600067712410553 |
| 199 | +total 19576 |
| 200 | +drwxr-xr-x 2 root root 80 Jun 24 21:37 . |
| 201 | +drwxr-xr-x 3 root root 60 Jun 24 21:35 .. |
| 202 | +-rw-r--r-- 1 root root 293 Jun 24 21:37 instance_info.txt |
| 203 | +-rw-r--r-- 1 root root 20038087 Jun 24 21:37 nvidia-bug-report.log.gz |
| 204 | +``` |
| 205 | +
|
| 206 | +The first file holds the basic information about the GCE instance, eg. the |
| 207 | +project id, instance id, machine type etc. |
| 208 | +
|
| 209 | +```bash |
| 210 | +GCE Instance Info: |
| 211 | + Project ID: gpu-test-project-staging |
| 212 | + Instance ID: 8203180632673949960 |
| 213 | + Image: cos-121-18867-90-38 |
| 214 | + Zone: us-central1-staginga |
| 215 | + Machine Type: a4-highgpu-8g |
| 216 | + Architecture: Architecture.X86 |
| 217 | + MST Version: mst, mft 4.32.0-120, built on Apr 30 2025, 09:17:51. Git SHA Hash: N/A |
| 218 | +``` |
| 219 | +
|
| 220 | +The second file, generated by `nvidia-bug-report.sh`, contains more information |
| 221 | +on the GPU devices and the system in general, including the GPU states, PCI tree |
| 222 | +topology, system dmesg, etc. |
| 223 | +
|
| 224 | +* If you are running with GPU architectures supported by the NVIDIA MFT (eg. |
| 225 | + B200), you may also validate that the GPU NVLink information are also being |
| 226 | + recorded. You can easily validate it through searching for the keyword |
| 227 | + `Starting GPU MST dump..` in the unzipped log file: |
| 228 | +
|
| 229 | +```bash |
| 230 | +$ sudo gunzip /tmp/nvidia_bug_reports/utc_2025_06_25_21_21_23/vm_id_8220847375056493254/nvidia-bug-report.log.gz |
| 231 | +$ grep -m 1 -A 30 "Starting GPU MST dump.." /tmp/nvidia_bug_reports/utc_2025_06_25_21_21_23/vm_id_8220847375056493254/nvidia-bug-report.log |
| 232 | +Starting GPU MST dump.../dev/mst/netir10497_00.cc.00_gpu7 |
| 233 | +____________________________________________ |
| 234 | +/usr/bin/mlxlink -d /dev/mst/netir10497_00.cc.00_gpu7 --amber_collect /tmp/mlx96.csv > /tmp/mlx96.info 2>&1 |
| 235 | +
|
| 236 | +Operational Info |
| 237 | +---------------- |
| 238 | +State : Active |
| 239 | +Physical state : N/A |
| 240 | +Speed : NVLink-XDR |
| 241 | +Width : 2x |
| 242 | +FEC : Interleaved_Standard_RS_FEC_PLR - (544,514) |
| 243 | +Loopback Mode : No Loopback |
| 244 | +Auto Negotiation : ON |
| 245 | +
|
| 246 | +Supported Info |
| 247 | +-------------- |
| 248 | +Enabled Link Speed : 0x00000100 (XDR) |
| 249 | +Supported Cable Speed : N/A |
| 250 | +
|
| 251 | +Troubleshooting Info |
| 252 | +-------------------- |
| 253 | +Status Opcode : 0 |
| 254 | +Group Opcode : N/A |
| 255 | +Recommendation : No issue was observed |
| 256 | +
|
| 257 | +Tool Information |
| 258 | +---------------- |
| 259 | +Firmware Version : 36.2014.1676 |
| 260 | +amBER Version : 4.8 |
| 261 | +MFT Version : mft 4.32.0-120 |
| 262 | +
|
| 263 | +``` |
| 264 | +
|
| 265 | +## 🛠️ Developer Guide: Modifying and Releasing Your Own Image |
| 266 | +
|
| 267 | +This section is for developers who wish to customize the script's behavior or |
| 268 | +release their own version of the container image to a private Google Artifact |
| 269 | +Registry. |
| 270 | + |
| 271 | +### 1. Modifying the Code |
| 272 | + |
| 273 | +* The core logic for generating the bug report are located in the `app/` |
| 274 | + directory, with the main entry point being `gce-cos-nvidia-bug-report.py`. |
| 275 | +* The image and all relevant dependencies to run the Python file above are |
| 276 | + defined in the `Dockerfile`. |
| 277 | + |
| 278 | +### 2. Building and Pushing to Artifact Registry |
| 279 | + |
| 280 | +We provide a convenient shell script to build and push your customized image to |
| 281 | +your own Artifact Registry. |
| 282 | + |
| 283 | +You can do so by invoking the `build-and-push-gce-cos-nvidia-bug-report.sh` |
| 284 | +script with the following parameters: |
| 285 | + |
| 286 | +| Flag | Description | Required | |
| 287 | +| :--- | :-------------------------------------------------------------- | :------- | |
| 288 | +| `-p` | Your Google Cloud Project ID. | **Yes** | |
| 289 | +| `-r` | The name of your Artifact Registry repository. | **Yes** | |
| 290 | +| `-i` | The name for your image. | **Yes** | |
| 291 | +| `-l` | The region of your Artifact Registry. Defaults to `us-central1` | No | |
| 292 | +| `-h` | Display the help message. | No | |
| 293 | + |
| 294 | +Sample command: |
| 295 | + |
| 296 | +```bash |
| 297 | +bash build-and-push-gce-cos-nvidia-bug-report.sh \ |
| 298 | + -p ${PROJECT?} \ |
| 299 | + -r ${ARTIFACT_REPO?} \ |
| 300 | + -i "custom-bug-report-collector" \ |
| 301 | + -l "us-east1" |
| 302 | +``` |
0 commit comments