[fix] replacing torch.cuda.set_device with CUDA_VISIBLE_DEVICES #85

oandreeva-nv · 2025-03-21T23:24:12Z

in my experiments with loading multiple models onto different gpus torch.cuda.set_device wasn't behaving as a GPU "router", thus using CUDA_VISIBLE_DEVICES
before the fix, tp=1 count=2 test GPU utilisation:

Test Matrix: model='vllm_opt_KIND_GPU_tp1_count2', kind='KIND_GPU', tp='1', instance_count='2'


=============== Before Loading vLLM Model ===============
GPU 0 Memory Utilization: 1426849792 bytes
GPU 1 Memory Utilization: 1426849792 bytes
=============== After Loading vLLM Model ===============
GPU 0 Memory Utilization: 44708986880 bytes
GPU 1 Memory Utilization: 1866727424 bytes

With this fix:

Test Matrix: model='vllm_opt_KIND_GPU_tp1_count2', kind='KIND_GPU', tp='1', instance_count='2'


=============== Before Loading vLLM Model ===============
GPU 0 Memory Utilization: 1426849792 bytes
GPU 1 Memory Utilization: 1426849792 bytes
=============== After Loading vLLM Model ===============
GPU 0 Memory Utilization: 43918819328 bytes
GPU 1 Memory Utilization: 43916722176 bytes

also tested with multiple TP=1 models + different GPU ids in a config.pbtxt in a cluster

replacing torch.cuda.set_device with

e184357

oandreeva-nv requested a review from kthui March 21, 2025 23:24

Added test

524a5f8

rmccorm4 approved these changes Apr 2, 2025

View reviewed changes

oandreeva-nv merged commit b40fca4 into main Apr 2, 2025
3 checks passed

oandreeva-nv deleted the oandreeva_torch_fix branch April 2, 2025 22:36

nvda-mesharma added Passed_on_A100 Passed_on_H100 and removed Passed_on_A100 Passed_on_H100 labels Apr 9, 2025

nvda-mesharma added Passed_on_A100 Passed_on_H100 labels May 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fix] replacing torch.cuda.set_device with CUDA_VISIBLE_DEVICES #85

[fix] replacing torch.cuda.set_device with CUDA_VISIBLE_DEVICES #85

Uh oh!

oandreeva-nv commented Mar 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[fix] replacing torch.cuda.set_device with CUDA_VISIBLE_DEVICES #85

[fix] replacing torch.cuda.set_device with CUDA_VISIBLE_DEVICES #85

Uh oh!

Conversation

oandreeva-nv commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oandreeva-nv commented Mar 21, 2025 •

edited

Loading