Skip to content

Issue after GPU memory being corrupted #827

Open
@iconoclasts

Description

@iconoclasts

Hello! I was working on Sionna v1.0.1 and had a problem after I saw the error ‘GPU memory is corrupted’ (Unfortunately, I didn’t capture that message). Although I am going to reflash the server and configure NVIDIA driver and CUDA once again, I would like to report an issue.

After I saw the message ‘GPU memory is corrupted’, I rebooted the server and tried ’nvidia-smi’. It turned out that the drivers were not working, so I reinstalled the driver. After that, both ’nvidia-smi’ and ‘nvcc —version’ worked fine (Tensorflow also had no problem detecting two GPUs).

Image

Image

However, the kernel always dies whenever the code starts ‘load_scene’ function. I also made another virtual environment, the kernel still dies at the same cell. When I run the first cell that imports packages, it shows this message.

Image

jit_find_library(): Unable to load "/usr/lib/x86_64-linux-gnu": [/usr/lib/x86_64-linux-gnu](https://vscode-remote+ssh-002dremote-002bwinslab-002darc-002dgnb1.vscode-resource.vscode-cdn.net/usr/lib/x86_64-linux-gnu): cannot read file data: Is a directory!
2025-04-08 19:17:54.809230: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-04-08 19:17:55.093570: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1744139875.233963   25330 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744139875.277925   25330 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1744139875.556433   25330 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1744139875.556445   25330 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1744139875.556448   25330 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1744139875.556450   25330 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-04-08 19:17:55.584220: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Num GPUs Available 2

The second cell only has ‘load_scene’ function, but the kernel always dies in this second cell. This is what I get in Jupyter output.

Image

19:25:05.399 [error] Disposing session as kernel process died ExitCode: undefined, Reason: 2025-04-08 19:24:54.516322: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-04-08 19:24:54.530610: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1744140294.547673   26032 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744140294.553177   26032 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1744140294.566128   26032 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1744140294.566140   26032 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1744140294.566142   26032 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1744140294.566143   26032 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-04-08 19:24:54.570009: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

It seems like both tensorflow and mitsuba (or drjit) are trying to acquire GPU and clash.

Will there be a solution to this problem?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions