Skip to content

Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues #675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

VibhuJawa
Copy link
Contributor

@VibhuJawa VibhuJawa commented Apr 18, 2025

Description

This PR fixes : https://github.com/NVIDIA/NeMo-Curator/pull/61/files by ensuring we always have cuda context spread across multiple GPUs.

Local Test to verify this:

#!/usr/bin/env python3
"""
Test to ensure that we fail if over subscribed on GPUs and deviate from 1 GPU per worker model
"""
import time
from nemo_curator.pii.algorithm import PiiDeidentifier
from nemo_curator import get_client, __version__ as curator_version

def main() -> None:
    print(f"NeMo Curator version: {curator_version}")
    client = get_client(cluster_type="gpu", rmm_pool_size="1GB")
    time.sleep(3)  # give workers time to register

if __name__ == "__main__":
    main()

With PR:

NeMo Curator version: 0.9.0rc0.dev0
cuDF Spilling is enabled
Traceback (most recent call last):
  File "/home/nfs/vjawa/NeMo-Curator/tests/test_cluster_cuda.py", line 15, in <module>
    main()
  File "/home/nfs/vjawa/NeMo-Curator/tests/test_cluster_cuda.py", line 11, in main
    client = get_client(cluster_type="gpu", rmm_pool_size="1GB")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nfs/vjawa/NeMo-Curator/nemo_curator/utils/distributed_utils.py", line 344, in get_client
    _assert_unique_gpu_per_host(client)
  File "/home/nfs/vjawa/NeMo-Curator/nemo_curator/utils/distributed_utils.py", line 147, in _assert_unique_gpu_per_host
    raise RuntimeError(report)
RuntimeError: Duplicate GPU assignment detected!

Host: dgx11  (total workers: 8)
  GPU 08 workers
Each worker on a host must own a distinct GPU.

Without PR (No error/warnings are raised):

NeMo Curator version: 0.9.0rc0.dev0
cuDF Spilling is enabled

But nvidia-smi, looks like below:

vjawa@dgx11:~/NeMo-Curator$ nvidia-smi
Fri Apr 25 10:55:19 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:06:00.0 Off |                    0 |
| N/A   32C    P0              57W / 300W |  10133MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0              43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           On  | 00000000:0A:00.0 Off |                    0 |
| N/A   31C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           On  | 00000000:0B:00.0 Off |                    0 |
| N/A   29C    P0              41W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2-32GB           On  | 00000000:85:00.0 Off |                    0 |
| N/A   31C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2-32GB           On  | 00000000:86:00.0 Off |                    0 |
| N/A   31C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2-32GB           On  | 00000000:89:00.0 Off |                    0 |
| N/A   34C    P0              43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2-32GB           On  | 00000000:8A:00.0 Off |                    0 |
| N/A   30C    P0              43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2093974      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093976      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093982      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093984      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093989      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093995      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093997      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2094001      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |

@VibhuJawa VibhuJawa requested a review from Copilot April 18, 2025 20:51
@VibhuJawa VibhuJawa marked this pull request as ready for review April 18, 2025 20:51
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reverts prior changes and refactors GPU cluster client creation to ensure proper CUDA context usage across multiple GPUs. The key changes include:

  • Introducing functions (_worker_gpu_tuple and _assert_unique_gpu_per_host) to verify unique GPU assignment on each host.
  • Adjusting the client initialization flow to perform GPU uniqueness checks for "gpu" clusters.
  • Tidying up import comments in the modules initializer and modifying the PII deidentifier import for clarity.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
nemo_curator/utils/distributed_utils.py Adds GPU worker functions and integrates a uniqueness assertion check
nemo_curator/modules/init.py Removes duplicate PyTorch-related import comments and clarifies order
nemo_curator/modifiers/pii_modifier.py Changes PiiDeidentifier import to a forward reference with a linter comment
Comments suppressed due to low confidence (1)

nemo_curator/modifiers/pii_modifier.py:88

  • [nitpick] If not hindered by circular dependency issues, consider importing 'PiiDeidentifier' directly to avoid the need for a forward reference and linter suppression.
def load_deidentifier(self) -> "PiiDeidentifier":  # noqa: F821

@VibhuJawa VibhuJawa added the gpuci Run GPU CI/CD on PR label Apr 18, 2025
@VibhuJawa VibhuJawa requested a review from Copilot April 21, 2025 20:40
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reverts previous changes and fixes CUDA context issues by ensuring unique GPU allocation across workers in the Dask-CUDA cluster while also cleaning up duplicate comments in the module initializer.

  • Added helper functions _worker_gpu_tuple and _assert_unique_gpu_per_host to verify unique GPU assignments per host.
  • Updated get_client to invoke the uniqueness check for GPU clusters.
  • Removed duplicate comments in nemo_curator/modules/init.py.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
nemo_curator/utils/distributed_utils.py Added functions to retrieve worker GPU info and enforce GPU uniqueness; updated client return logic.
nemo_curator/modules/init.py Removed duplicate comments regarding PyTorch and cuGraph import ordering.
Comments suppressed due to low confidence (1)

nemo_curator/utils/distributed_utils.py:103

  • The newly added code calls warnings.warn but there is no explicit import of the warnings module in this file. Please ensure that 'import warnings' is added to avoid runtime errors.
warnings.warn(f"NVML error occurred: {e} while verifying GPU index", stacklevel=2)

@VibhuJawa VibhuJawa changed the title Fix NeMo Curator Cluster Creation Cuda context issues Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues Apr 25, 2025
@VibhuJawa VibhuJawa requested a review from Copilot April 25, 2025 17:52
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reverts previous changes to regain the ruff formatting while introducing a check to ensure that each Dask worker on a given host owns a unique GPU. Key changes include:

  • Adding a new helper (_worker_gpu_tuple) to initialize the CUDA context on workers.
  • Implementing _assert_unique_gpu_per_host to verify unique GPU assignment on a host.
  • Minor reordering of import comments in init.py.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
nemo_curator/utils/distributed_utils.py Introduces GPU context initialization and duplicate GPU assignment checks.
nemo_curator/modules/init.py Removes redundant PyTorch import comments.

Copy link
Contributor

@praateekmahajan praateekmahajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

"""

# Touch the GPU so a context is created (idempotent if one already exists)
cp.cuda.runtime.getDevice()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not create a CUDA context, you can verify that by watching nvidia-smi while you launch a process that calls this. We canonically use numba.cuda.current_context() in Dask, I would suggest the same because that is already tested and known to work without any side-effects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants