-
Notifications
You must be signed in to change notification settings - Fork 147
Open
Description
🐛 Bug
Device Request capabilities should be updated to "gpu", not "compute"
https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L308
c.kwargs["device_requests"] = [
DeviceRequest(
count=resource.gpu,
capabilities=[["compute"]],
)
]
Module (check all that applies):
-
torchx.spec -
torchx.component -
torchx.apps -
torchx.runtime -
torchx.cli - [ x]
torchx.schedulers -
torchx.pipelines -
torchx.aws -
torchx.examples -
other
To Reproduce
Steps to reproduce the behavior:
- start any container with local_docker scheduler on a machine with nvidia gpu
- run nvidia-smi inside container to verify that container does not detect gpu
pretrain/0
pretrain/0 =============
pretrain/0 == PyTorch ==
pretrain/0 =============
pretrain/0
pretrain/0 NVIDIA Release 23.12 (build 76438008)
pretrain/0 PyTorch Version 2.2.0a0+81ea7a4
pretrain/0
pretrain/0 Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
pretrain/0
pretrain/0 Copyright (c) 2014-2023 Facebook Inc.
pretrain/0 Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
pretrain/0 Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
pretrain/0 Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
pretrain/0 Copyright (c) 2011-2013 NYU (Clement Farabet)
pretrain/0 Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
pretrain/0 Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
pretrain/0 Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
pretrain/0 Copyright (c) 2015 Google Inc.
pretrain/0 Copyright (c) 2015 Yangqing Jia
pretrain/0 Copyright (c) 2013-2016 The Caffe contributors
pretrain/0 All rights reserved.
pretrain/0
pretrain/0 Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
pretrain/0
pretrain/0 This container image and its contents are governed by the NVIDIA Deep Learning Container License.
pretrain/0 By pulling and using the container, you accept the terms and conditions of this license:
pretrain/0 https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
pretrain/0
pretrain/0 Failed to detect NVIDIA driver version.
Expected behavior
if device capability is properly set to "gpu", then i should see devices inside container and can detect nvidia driver
after changing "compute" to "gpu", works as expected
pretrain/0
pretrain/0 =============
pretrain/0 == PyTorch ==
pretrain/0 =============
pretrain/0
pretrain/0 NVIDIA Release 23.12 (build 76438008)
pretrain/0 PyTorch Version 2.2.0a0+81ea7a4
pretrain/0
pretrain/0 Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
pretrain/0
pretrain/0 Copyright (c) 2014-2023 Facebook Inc.
pretrain/0 Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
pretrain/0 Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
pretrain/0 Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
pretrain/0 Copyright (c) 2011-2013 NYU (Clement Farabet)
pretrain/0 Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
pretrain/0 Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
pretrain/0 Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
pretrain/0 Copyright (c) 2015 Google Inc.
pretrain/0 Copyright (c) 2015 Yangqing Jia
pretrain/0 Copyright (c) 2013-2016 The Caffe contributors
pretrain/0 All rights reserved.
pretrain/0
pretrain/0 Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
pretrain/0
pretrain/0 This container image and its contents are governed by the NVIDIA Deep Learning Container License.
pretrain/0 By pulling and using the container, you accept the terms and conditions of this license:
pretrain/0 https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
pretrain/0
pretrain/0 NOTE: CUDA Forward Compatibility mode ENABLED.
pretrain/0 Using CUDA 12.3 driver version 545.23.08 with kernel driver version 535.129.03.
pretrain/0 See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
pretrain/0
Environment
- torchx version (e.g. 0.1.0rc1): 0.6.0
- Python version: 3.10
- OS (e.g., Linux): AL2
- How you installed torchx (
conda,pip, source,docker): pip - Docker image and tag (if using docker):
- Git commit (if installed from source):
- Execution environment (on-prem, AWS, GCP, Azure etc):
- Any other relevant information:
Additional context
Metadata
Metadata
Assignees
Labels
No labels