Skip to content

fix: query GPU memory with torch first#2349

Open
he-yufeng wants to merge 1 commit into
recommenders-team:mainfrom
he-yufeng:fix/gpu-info-torch-main
Open

fix: query GPU memory with torch first#2349
he-yufeng wants to merge 1 commit into
recommenders-team:mainfrom
he-yufeng:fix/gpu-info-torch-main

Conversation

@he-yufeng

Copy link
Copy Markdown
Contributor

Summary

  • prefer torch.cuda.mem_get_info() for get_gpu_info() when PyTorch can see CUDA devices
  • keep the existing numba path as a fallback for environments without torch
  • add a regression test that fails if the torch-visible CUDA path still touches numba's CUDA context

Why

Issue #2344 shows get_gpu_info() can trip a numba CUDA context error while collecting GPU memory. get_number_gpus() already tries torch first, so this makes get_gpu_info() follow the same safer path and avoids creating a numba context when torch already has the data.

To verify

  • python -m py_compile recommenders\utils\gpu_utils.py tests\unit\recommenders\utils\test_gpu_utils.py
  • .\.venv312\Scripts\python.exe -m pytest tests\unit\recommenders\utils\test_gpu_utils.py::test_get_gpu_info_uses_torch_cuda tests\unit\recommenders\utils\test_gpu_utils.py::test_get_number_gpus_without_torch -q
  • git diff --cached --check before commit

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
@he-yufeng he-yufeng force-pushed the fix/gpu-info-torch-main branch from f149503 to c063d44 Compare June 12, 2026 06:48
@he-yufeng

Copy link
Copy Markdown
Contributor Author

I checked the red GPU shards. The CPU/Spark groups passed, and the GPU jobs fail before tests run because the external GPU worker setup cannot provide a usable device: repeated cloud allocation errors are followed by \No devices were found\ and exit code 6. This does not appear to be caused by the code change. Could the failed GPU jobs be rerun when capacity is available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant