Issue after GPU memory being corrupted

Hello! I was working on Sionna v1.0.1 and had a problem after I saw the error ‘GPU memory is corrupted’ (Unfortunately, I didn’t capture that message). Although I am going to reflash the server and configure NVIDIA driver and CUDA once again, I would like to report an issue.

After I saw the message ‘GPU memory is corrupted’, I rebooted the server and tried ’nvidia-smi’. It turned out that the drivers were not working, so I reinstalled the driver. After that, both ’nvidia-smi’ and ‘nvcc —version’ worked fine (Tensorflow also had no problem detecting two GPUs).

![Image](https://github.com/user-attachments/assets/ca5c101d-2328-4457-b8b0-b6962700e6d5)

![Image](https://github.com/user-attachments/assets/5c6d7096-afea-4074-a67f-e69c6a026096)

However, the kernel always dies whenever the code starts ‘load_scene’ function. I also made another virtual environment, the kernel still dies at the same cell. When I run the first cell that imports packages, it shows this message.

![Image](https://github.com/user-attachments/assets/0d140389-5630-4d6b-86b9-eee9cf4e01e2)

```
jit_find_library(): Unable to load "/usr/lib/x86_64-linux-gnu": [/usr/lib/x86_64-linux-gnu](https://vscode-remote+ssh-002dremote-002bwinslab-002darc-002dgnb1.vscode-resource.vscode-cdn.net/usr/lib/x86_64-linux-gnu): cannot read file data: Is a directory!
2025-04-08 19:17:54.809230: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-04-08 19:17:55.093570: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1744139875.233963   25330 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744139875.277925   25330 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1744139875.556433   25330 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1744139875.556445   25330 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1744139875.556448   25330 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1744139875.556450   25330 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-04-08 19:17:55.584220: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Num GPUs Available 2
```

The second cell only has ‘load_scene’ function, but the kernel always dies in this second cell. This is what I get in Jupyter output.

![Image](https://github.com/user-attachments/assets/98959941-1372-4851-bd92-d424f215ea5c)

```
19:25:05.399 [error] Disposing session as kernel process died ExitCode: undefined, Reason: 2025-04-08 19:24:54.516322: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-04-08 19:24:54.530610: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1744140294.547673   26032 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744140294.553177   26032 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1744140294.566128   26032 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1744140294.566140   26032 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1744140294.566142   26032 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1744140294.566143   26032 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-04-08 19:24:54.570009: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
```

It seems like both tensorflow and mitsuba (or drjit) are trying to acquire GPU and clash.

Will there be a solution to this problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue after GPU memory being corrupted #827

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue after GPU memory being corrupted #827

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions