Skip to content

Commit 6e8b14b

Browse files
Merge pull request #4317 from ljqg/mft_on_cos
Feat: add script to automate nvidia-bug-report collection on GCE COS VM.
2 parents d733fad + 9e956da commit 6e8b14b

File tree

6 files changed

+1360
-0
lines changed

6 files changed

+1360
-0
lines changed
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Copyright 2025 "Google LLC"
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# Using Debian 12.10 as the base image
16+
FROM debian:12.10
17+
18+
# Set the working directory in the container
19+
WORKDIR /app
20+
21+
# Copy the local directory to the container
22+
COPY app/* /app/
23+
24+
ENV PATH $PATH:/var/lib/nvidia/bin:/var/lib/nvidia/lib64:/usr/local/nvidia/bin:/usr/local/nvidia/lib64/
25+
ENV LD_LIBRARY_PATH $LD_LIBRARY_PATH:/var/lib/nvidia/bin:/var/lib/nvidia/lib64:/usr/local/nvidia/bin:/usr/local/nvidia/lib64
26+
27+
28+
# Install system dependencies
29+
RUN apt-get update && \
30+
apt-get install -y util-linux pciutils \
31+
build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev \
32+
libssl-dev libreadline-dev libffi-dev libsqlite3-dev wget libbz2-dev \
33+
acpica-tools kmod usbutils procps nano python3-pip python3.11-venv
34+
35+
RUN wget https://www.python.org/ftp/python/3.13.1/Python-3.13.1.tgz
36+
37+
RUN tar -xvf Python-3.13.1.tgz
38+
39+
RUN cd Python-3.13.1 && ./configure --enable-optimizations && \
40+
make -j $(nproc) && make altinstall
41+
42+
43+
RUN ln -s /usr/local/bin/pip3.13 /usr/local/bin/pip3 && \
44+
ln -s /usr/local/bin/python3.13 /usr/local/bin/python3
45+
46+
# Create and activate a virtual environment.
47+
RUN python3 -m venv .venv
48+
49+
RUN . .venv/bin/activate
50+
51+
# Install Python dependencies
52+
RUN .venv/bin/pip install --no-cache-dir -r requirements.txt
53+
54+
ENTRYPOINT [".venv/bin/python", "gce-cos-nvidia-bug-report.py"]
Lines changed: 302 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,302 @@
1+
# GCE COS NVIDIA Bug Report Collector
2+
3+
![Platform: GCE/COS](https://img.shields.io/badge/Platform-GCE%2FCOS-green.svg)
4+
5+
A universal tool to simplify the generation of NVIDIA bug reports on Google
6+
Compute Engine (GCE) VMs that use the Container-Optimized OS (COS) guest
7+
operating system.
8+
9+
This script provides a simple and reliable one-command experience to collect
10+
standard `nvidia-bug-report` logs. For GPUs with **Blackwell** architectures and
11+
newer, it automatically installs the
12+
[NVIDIA MFT (Nvidia Firmware Tools)](https://docs.nvidia.com/networking/display/mftv4320)
13+
to generate a more comprehensive report with deep hardware diagnostics.
14+
15+
--------------------------------------------------------------------------------
16+
17+
## 🤔 The Challenge: Getting GPU Bug Report on GCE with COS
18+
19+
When troubleshooting GPU issues, the first step is often to generate an
20+
`nvidia-bug-report`. However, doing so on a **Google Compute Engine (GCE)
21+
virtual machine that uses Container-Optimized OS (COS) as its guest operating
22+
system** could be a less trivial.
23+
24+
COS is a minimal, security-hardened operating system from Google, designed
25+
specifically for running containers. By design, it does not include many
26+
standard packages or libraries that general-purpose debug tools often rely on
27+
and is design to mainly execute userspace programs through containers. This
28+
stripped down nature therefore requires some additional efforts to collect and
29+
export a comprehensive GPU bug report on COS systems.
30+
31+
### 🔬 Enhanced Bug Report for Blackwell & Newer GPUs
32+
33+
For newer GPU architectures like **NVIDIA Blackwell**, a standard NVIDIA bug
34+
report, while useful, may not be sufficient for diagnosing complex
35+
hardware-level issues, especially those related to
36+
[NVLink](https://www.nvidia.com/en-us/data-center/nvlink/). A truly
37+
comprehensive report requires deeper diagnostic data.
38+
39+
This is where the NVIDIA MFT suite becomes essential. You do not need to
40+
interact with MFT directly; instead, the `nvidia-bug-report` script is designed
41+
to automatically leverage the MFT utilities if they are present on the system.
42+
By doing so, it can generate a far more comprehensive GPU bug report for
43+
diagnostics.
44+
45+
When available, MFT allows the bug report to include critical, low-level
46+
hardware data such as:
47+
48+
* The physical layer status of NVLink connections.
49+
* Internal GPU register values and configuration data.
50+
* Raw diagnostic segments generated directly by the firmware.
51+
52+
However, setting up MFT is a cumbersome process on COS:
53+
54+
1. **Kernel Module Handling**: A user must first locate and download the
55+
specific, **COS-signed** MFT kernel module that perfectly corresponds to
56+
their exact COS image version. Only a signed, version-matched module can be
57+
loaded into the COS kernel.
58+
2. **Userspace Program and Containerization**: Following the COS design
59+
philosophy, all applications should run in containers. This means the user
60+
must create a custom container that includes the MFT userspace programs,
61+
which also must be compatible with the kernel module.
62+
3. **Execution and Export**: The bug report generation must be triggered from
63+
within this custom container. Afterward, a mechanism is needed to export the
64+
final log file from the container out to the host VM or a GCS bucket.
65+
66+
## 💡 Our Solution: A Smart, All-in-One Collector
67+
68+
This script eliminates all of the aforementioned complexity. It acts as a
69+
universal collector that simplifies bug report generation for all users on GCE
70+
with COS.
71+
72+
* **For all supported GPUs**, it automates the steps needed to generate a
73+
standard `nvidia-bug-report`.
74+
* **For Blackwell and newer GPU architectures**, it automatically detects the
75+
hardware and handles the entire MFT setup process in the background and then
76+
generates a more comprehensive bug report.
77+
78+
This transforms the NVIDIA GPU bug report generation task on COS into a single
79+
docker command.
80+
81+
### ✨ Key Features
82+
83+
* **Universal Collector for GCE COS**: A single, simple command to generate an
84+
`nvidia-bug-report` on any supported GCE VM with the COS guest OS.
85+
* **Automatic MFT Enhancement**: For Blackwell and newer GPUs, the script
86+
automatically installs and configures the NVIDIA MFT suite to unlock deeper,
87+
more comprehensive hardware diagnostics.
88+
* **Optional GCS Upload**: Directly uploads the final report to a Google Cloud
89+
Storage bucket for easy sharing and analysis.
90+
91+
## 📋 Prerequisites
92+
93+
Before running the script, ensure you have:
94+
95+
1. A Google Compute Engine (GCE) GPU VM instance with **Container-Optimized OS
96+
(COS)** as its guest operating system.
97+
98+
2. The GPU driver is installed on the VM instance.
99+
* Please refer to
100+
[COS's official documentation page](\(https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#install\))
101+
for more detail.
102+
* Sample commands to install the GPU driver and verify the installation:
103+
104+
```bash
105+
# Install NVIDIA GPU Driver
106+
sudo cos-extensions install gpu -- --version=latest
107+
108+
# Make the driver installation path executable by re-mounting it.
109+
sudo mount --bind /var/lib/nvidia /var/lib/nvidia
110+
sudo mount -o remount,exec /var/lib/nvidia
111+
112+
# Display all GPUs
113+
/var/lib/nvidia/bin/nvidia-smi
114+
```
115+
116+
3. Configure Docker to use your Artifact Registry credentials when interacting
117+
with Artifact Registry.
118+
119+
* Please refer to
120+
[Artifact Registry's authentication page](https://cloud.google.com/artifact-registry/docs/docker/authentication)
121+
for more detail.
122+
* Sample commands to configure the docker credential:
123+
124+
```bash
125+
ARTIFACT_REGISTRIES="us-central1-docker.pkg.dev"
126+
docker-credential-gcr configure-docker --registries=${ARTIFACT_REGISTRIES?}
127+
```
128+
129+
4. [Optional] If you would like to export the bug report to GCS, the VM's
130+
service account must have *at least* Storage Object Creator
131+
(`roles/storage.objectCreator`) permissions for the target bucket.
132+
133+
* Our script would attempt to create the specified GCS bucket when the
134+
specified bucket does not exist in the project. If you would like to
135+
leverage this feature, then your service account needs to have the
136+
**Storage Admin (`roles/storage.admin`)** role.
137+
138+
* Sample commands to grant storage admin permission to your project's
139+
service account:
140+
141+
```bash
142+
PROJECT=... # your project id
143+
gcloud projects add-iam-policy-binding ${PROJECT?} \
144+
--member="serviceAccount:$(staging_gcloud iam service-accounts list --project=${PROJECT?} \
145+
--filter="email~'-compute@developer.gserviceaccount.com'" --format="value(email)")" \
146+
--role='roles/storage.admin'
147+
```
148+
149+
## 🚀 Quick Start
150+
151+
This tool is designed to be run as a Docker container. The primary method of use
152+
is a single docker run command.
153+
154+
Sample command to run on a VM with 8 GPUs:
155+
156+
Note: If you have a different number of GPUs on your system, you may need to
157+
adjust the `--device /dev/nvidia<gpu_num>:/dev/nvidia<gpu_num>` in the docker
158+
command accordingly.
159+
160+
Note: Exporting the final bug reports to a GCS bucket is optional. If you do not
161+
intend to export it elsewhere, you may remove the `--gcs_bucket=${GCS_BUCKET}`
162+
at the end.
163+
164+
```bash
165+
docker run \
166+
--name gce-cos-bug-report \
167+
--pull=always \
168+
--privileged \
169+
--volume /etc:/etc_host \
170+
--volume /tmp:/tmp \
171+
--volume /var/lib/nvidia:/usr/local/nvidia \
172+
--device /dev/nvidia0:/dev/nvidia0 \
173+
--device /dev/nvidia1:/dev/nvidia1 \
174+
--device /dev/nvidia2:/dev/nvidia2 \
175+
--device /dev/nvidia3:/dev/nvidia3 \
176+
--device /dev/nvidia4:/dev/nvidia4 \
177+
--device /dev/nvidia5:/dev/nvidia5 \
178+
--device /dev/nvidia6:/dev/nvidia6 \
179+
--device /dev/nvidia7:/dev/nvidia7 \
180+
--device /dev/nvidia-uvm:/dev/nvidia-uvm \
181+
--device /dev/nvidiactl:/dev/nvidiactl \
182+
us-central1-docker.pkg.dev/gce-ai-infra/gce-cos-nvidia-bug-report-repo/gce-cos-nvidia-bug-report:latest \
183+
--gcs_bucket=${GCS_BUCKET}
184+
```
185+
186+
### 📝 Example Output
187+
188+
```bash
189+
I0624 21:37:25.424124 137683091463040 gce-cos-nvidia-bug-report.py:817] Bug report logs are available locally at: /tmp/nvidia_bug_reports/utc_2025_06_24_21_35_44/vm_id_2858600067712410553
190+
I0624 21:37:25.794605 137683091463040 gce-cos-nvidia-bug-report.py:312] Bucket [my-nv-bug-reports] already exists in the project.
191+
I0624 21:37:26.308939 137683091463040 gce-cos-nvidia-bug-report.py:834] Bug report logs are available at: https://pantheon.corp.google.com/storage/browser/my-nv-bug-reports/bug_report/utc_2025_06_24_21_35_44/vm_id_2858600067712410553
192+
```
193+
194+
There will be two files getting generated as final outputs: i.e.
195+
`instance_info.txt` and `nvidia-bug-report.log.gz`.
196+
197+
```bash
198+
$ ls -la /tmp/nvidia_bug_reports/utc_2025_06_24_21_35_44/vm_id_2858600067712410553
199+
total 19576
200+
drwxr-xr-x 2 root root 80 Jun 24 21:37 .
201+
drwxr-xr-x 3 root root 60 Jun 24 21:35 ..
202+
-rw-r--r-- 1 root root 293 Jun 24 21:37 instance_info.txt
203+
-rw-r--r-- 1 root root 20038087 Jun 24 21:37 nvidia-bug-report.log.gz
204+
```
205+
206+
The first file holds the basic information about the GCE instance, eg. the
207+
project id, instance id, machine type etc.
208+
209+
```bash
210+
GCE Instance Info:
211+
Project ID: gpu-test-project-staging
212+
Instance ID: 8203180632673949960
213+
Image: cos-121-18867-90-38
214+
Zone: us-central1-staginga
215+
Machine Type: a4-highgpu-8g
216+
Architecture: Architecture.X86
217+
MST Version: mst, mft 4.32.0-120, built on Apr 30 2025, 09:17:51. Git SHA Hash: N/A
218+
```
219+
220+
The second file, generated by `nvidia-bug-report.sh`, contains more information
221+
on the GPU devices and the system in general, including the GPU states, PCI tree
222+
topology, system dmesg, etc.
223+
224+
* If you are running with GPU architectures supported by the NVIDIA MFT (eg.
225+
B200), you may also validate that the GPU NVLink information are also being
226+
recorded. You can easily validate it through searching for the keyword
227+
`Starting GPU MST dump..` in the unzipped log file:
228+
229+
```bash
230+
$ sudo gunzip /tmp/nvidia_bug_reports/utc_2025_06_25_21_21_23/vm_id_8220847375056493254/nvidia-bug-report.log.gz
231+
$ grep -m 1 -A 30 "Starting GPU MST dump.." /tmp/nvidia_bug_reports/utc_2025_06_25_21_21_23/vm_id_8220847375056493254/nvidia-bug-report.log
232+
Starting GPU MST dump.../dev/mst/netir10497_00.cc.00_gpu7
233+
____________________________________________
234+
/usr/bin/mlxlink -d /dev/mst/netir10497_00.cc.00_gpu7 --amber_collect /tmp/mlx96.csv > /tmp/mlx96.info 2>&1
235+
236+
Operational Info
237+
----------------
238+
State : Active
239+
Physical state : N/A
240+
Speed : NVLink-XDR
241+
Width : 2x
242+
FEC : Interleaved_Standard_RS_FEC_PLR - (544,514)
243+
Loopback Mode : No Loopback
244+
Auto Negotiation : ON
245+
246+
Supported Info
247+
--------------
248+
Enabled Link Speed : 0x00000100 (XDR)
249+
Supported Cable Speed : N/A
250+
251+
Troubleshooting Info
252+
--------------------
253+
Status Opcode : 0
254+
Group Opcode : N/A
255+
Recommendation : No issue was observed
256+
257+
Tool Information
258+
----------------
259+
Firmware Version : 36.2014.1676
260+
amBER Version : 4.8
261+
MFT Version : mft 4.32.0-120
262+
263+
```
264+
265+
## 🛠️ Developer Guide: Modifying and Releasing Your Own Image
266+
267+
This section is for developers who wish to customize the script's behavior or
268+
release their own version of the container image to a private Google Artifact
269+
Registry.
270+
271+
### 1. Modifying the Code
272+
273+
* The core logic for generating the bug report are located in the `app/`
274+
directory, with the main entry point being `gce-cos-nvidia-bug-report.py`.
275+
* The image and all relevant dependencies to run the Python file above are
276+
defined in the `Dockerfile`.
277+
278+
### 2. Building and Pushing to Artifact Registry
279+
280+
We provide a convenient shell script to build and push your customized image to
281+
your own Artifact Registry.
282+
283+
You can do so by invoking the `build-and-push-gce-cos-nvidia-bug-report.sh`
284+
script with the following parameters:
285+
286+
| Flag | Description | Required |
287+
| :--- | :-------------------------------------------------------------- | :------- |
288+
| `-p` | Your Google Cloud Project ID. | **Yes** |
289+
| `-r` | The name of your Artifact Registry repository. | **Yes** |
290+
| `-i` | The name for your image. | **Yes** |
291+
| `-l` | The region of your Artifact Registry. Defaults to `us-central1` | No |
292+
| `-h` | Display the help message. | No |
293+
294+
Sample command:
295+
296+
```bash
297+
bash build-and-push-gce-cos-nvidia-bug-report.sh \
298+
-p ${PROJECT?} \
299+
-r ${ARTIFACT_REPO?} \
300+
-i "custom-bug-report-collector" \
301+
-l "us-east1"
302+
```

0 commit comments

Comments
 (0)