fix: query GPU memory with torch first by he-yufeng · Pull Request #2349 · recommenders-team/recommenders

he-yufeng · 2026-06-12T06:43:39Z

Summary

prefer torch.cuda.mem_get_info() for get_gpu_info() when PyTorch can see CUDA devices
keep the existing numba path as a fallback for environments without torch
add a regression test that fails if the torch-visible CUDA path still touches numba's CUDA context

Why

Issue #2344 shows get_gpu_info() can trip a numba CUDA context error while collecting GPU memory. get_number_gpus() already tries torch first, so this makes get_gpu_info() follow the same safer path and avoids creating a numba context when torch already has the data.

To verify

python -m py_compile recommenders\utils\gpu_utils.py tests\unit\recommenders\utils\test_gpu_utils.py
.\.venv312\Scripts\python.exe -m pytest tests\unit\recommenders\utils\test_gpu_utils.py::test_get_gpu_info_uses_torch_cuda tests\unit\recommenders\utils\test_gpu_utils.py::test_get_number_gpus_without_torch -q
git diff --cached --check before commit

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

he-yufeng · 2026-06-12T09:22:40Z

I checked the red GPU shards. The CPU/Spark groups passed, and the GPU jobs fail before tests run because the external GPU worker setup cannot provide a usable device: repeated cloud allocation errors are followed by \No devices were found\ and exit code 6. This does not appear to be caused by the code change. Could the failed GPU jobs be rerun when capacity is available?

he-yufeng requested review from SimonYansenZhao, anargyri, loomlike, miguelgfierro and wav8k as code owners June 12, 2026 06:43

fix: query GPU memory with torch first

c063d44

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

he-yufeng force-pushed the fix/gpu-info-torch-main branch from f149503 to c063d44 Compare June 12, 2026 06:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: query GPU memory with torch first#2349

fix: query GPU memory with torch first#2349
he-yufeng wants to merge 1 commit into
recommenders-team:mainfrom
he-yufeng:fix/gpu-info-torch-main

he-yufeng commented Jun 12, 2026

Uh oh!

he-yufeng commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

he-yufeng commented Jun 12, 2026

Summary

Why

To verify

Uh oh!

he-yufeng commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant