Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues #675

VibhuJawa · 2025-04-18T20:49:52Z

Description

This PR fixes : https://github.com/NVIDIA/NeMo-Curator/pull/61/files by ensuring we always have cuda context spread across multiple GPUs.

Local Test to verify this:

#!/usr/bin/env python3
"""
Test to ensure that we fail if over subscribed on GPUs and deviate from 1 GPU per worker model
"""
import time
from nemo_curator.pii.algorithm import PiiDeidentifier
from nemo_curator import get_client, __version__ as curator_version

def main() -> None:
    print(f"NeMo Curator version: {curator_version}")
    client = get_client(cluster_type="gpu", rmm_pool_size="1GB")
    time.sleep(3)  # give workers time to register

if __name__ == "__main__":
    main()

With PR:

NeMo Curator version: 0.9.0rc0.dev0
cuDF Spilling is enabled
Traceback (most recent call last):
  File "/home/nfs/vjawa/NeMo-Curator/tests/test_cluster_cuda.py", line 15, in <module>
    main()
  File "/home/nfs/vjawa/NeMo-Curator/tests/test_cluster_cuda.py", line 11, in main
    client = get_client(cluster_type="gpu", rmm_pool_size="1GB")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nfs/vjawa/NeMo-Curator/nemo_curator/utils/distributed_utils.py", line 344, in get_client
    _assert_unique_gpu_per_host(client)
  File "/home/nfs/vjawa/NeMo-Curator/nemo_curator/utils/distributed_utils.py", line 147, in _assert_unique_gpu_per_host
    raise RuntimeError(report)
RuntimeError: Duplicate GPU assignment detected!

Host: dgx11  (total workers: 8)
  GPU 0 → 8 workers
Each worker on a host must own a distinct GPU.

Without PR (No error/warnings are raised):

NeMo Curator version: 0.9.0rc0.dev0
cuDF Spilling is enabled

But nvidia-smi, looks like below:

vjawa@dgx11:~/NeMo-Curator$ nvidia-smi
Fri Apr 25 10:55:19 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:06:00.0 Off |                    0 |
| N/A   32C    P0              57W / 300W |  10133MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0              43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           On  | 00000000:0A:00.0 Off |                    0 |
| N/A   31C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           On  | 00000000:0B:00.0 Off |                    0 |
| N/A   29C    P0              41W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2-32GB           On  | 00000000:85:00.0 Off |                    0 |
| N/A   31C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2-32GB           On  | 00000000:86:00.0 Off |                    0 |
| N/A   31C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2-32GB           On  | 00000000:89:00.0 Off |                    0 |
| N/A   34C    P0              43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2-32GB           On  | 00000000:8A:00.0 Off |                    0 |
| N/A   30C    P0              43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2093974      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093976      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093982      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093984      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093989      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093995      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093997      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2094001      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |

Signed-off-by: Vibhu Jawa <[email protected]>

Copilot

Pull Request Overview

This PR reverts prior changes and refactors GPU cluster client creation to ensure proper CUDA context usage across multiple GPUs. The key changes include:

Introducing functions (_worker_gpu_tuple and _assert_unique_gpu_per_host) to verify unique GPU assignment on each host.
Adjusting the client initialization flow to perform GPU uniqueness checks for "gpu" clusters.
Tidying up import comments in the modules initializer and modifying the PII deidentifier import for clarity.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
nemo_curator/utils/distributed_utils.py	Adds GPU worker functions and integrates a uniqueness assertion check
nemo_curator/modules/init.py	Removes duplicate PyTorch-related import comments and clarifies order
nemo_curator/modifiers/pii_modifier.py	Changes PiiDeidentifier import to a forward reference with a linter comment

Comments suppressed due to low confidence (1)

nemo_curator/modifiers/pii_modifier.py:88

[nitpick] If not hindered by circular dependency issues, consider importing 'PiiDeidentifier' directly to avoid the need for a forward reference and linter suppression.

def load_deidentifier(self) -> "PiiDeidentifier":  # noqa: F821

nemo_curator/utils/distributed_utils.py

Signed-off-by: Vibhu Jawa <[email protected]>

nemo_curator/utils/distributed_utils.py

Signed-off-by: Vibhu Jawa <[email protected]>

Copilot

Pull Request Overview

This PR reverts previous changes and fixes CUDA context issues by ensuring unique GPU allocation across workers in the Dask-CUDA cluster while also cleaning up duplicate comments in the module initializer.

Added helper functions _worker_gpu_tuple and _assert_unique_gpu_per_host to verify unique GPU assignments per host.
Updated get_client to invoke the uniqueness check for GPU clusters.
Removed duplicate comments in nemo_curator/modules/init.py.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
nemo_curator/utils/distributed_utils.py	Added functions to retrieve worker GPU info and enforce GPU uniqueness; updated client return logic.
nemo_curator/modules/init.py	Removed duplicate comments regarding PyTorch and cuGraph import ordering.

Comments suppressed due to low confidence (1)

nemo_curator/utils/distributed_utils.py:103

The newly added code calls warnings.warn but there is no explicit import of the warnings module in this file. Please ensure that 'import warnings' is added to avoid runtime errors.

warnings.warn(f"NVML error occurred: {e} while verifying GPU index", stacklevel=2)

Signed-off-by: Vibhu Jawa <[email protected]>

Copilot

Pull Request Overview

This PR reverts previous changes to regain the ruff formatting while introducing a check to ensure that each Dask worker on a given host owns a unique GPU. Key changes include:

Adding a new helper (_worker_gpu_tuple) to initialize the CUDA context on workers.
Implementing _assert_unique_gpu_per_host to verify unique GPU assignment on a host.
Minor reordering of import comments in init.py.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
nemo_curator/utils/distributed_utils.py	Introduces GPU context initialization and duplicate GPU assignment checks.
nemo_curator/modules/init.py	Removes redundant PyTorch import comments.

nemo_curator/utils/distributed_utils.py

Signed-off-by: Vibhu Jawa <[email protected]>

praateekmahajan

LGTM!

Signed-off-by: Vibhu Jawa <[email protected]>

pentschev · 2025-04-28T07:56:55Z

nemo_curator/utils/distributed_utils.py

+    """
+
+    # Touch the GPU so a context is created (idempotent if one already exists)
+    cp.cuda.runtime.getDevice()


This does not create a CUDA context, you can verify that by watching nvidia-smi while you launch a process that calls this. We canonically use numba.cuda.current_context() in Dask, I would suggest the same because that is already tested and known to work without any side-effects.

Fix NeMo Curator Cluster Creation Cuda context issues

4645e15

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa requested a review from Copilot April 18, 2025 20:51

VibhuJawa marked this pull request as ready for review April 18, 2025 20:51

Copilot AI reviewed Apr 18, 2025

View reviewed changes

nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved

Log pynvml fall back to warnings

b69022c

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa requested review from ayushdg and praateekmahajan April 18, 2025 20:54

VibhuJawa commented Apr 18, 2025

View reviewed changes

nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved

VibhuJawa added the gpuci Run GPU CI/CD on PR label Apr 18, 2025

VibhuJawa mentioned this pull request Apr 18, 2025

Revert PiiDeidentifier Import #676

Merged

VibhuJawa added 2 commits April 18, 2025 15:20

fix some minor nits by Praateek

668717c

Signed-off-by: Vibhu Jawa <[email protected]>

Merge branch 'main' into vjawa/fix_cuda_context_issues

373f345

VibhuJawa requested a review from Copilot April 21, 2025 20:40

Copilot AI reviewed Apr 21, 2025

View reviewed changes

VibhuJawa changed the title ~~Fix NeMo Curator Cluster Creation Cuda context issues~~ Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues Apr 25, 2025

Improve code based on Peters feedback

a03d31e

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa requested a review from Copilot April 25, 2025 17:52

Copilot AI reviewed Apr 25, 2025

View reviewed changes

nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved

VibhuJawa added 3 commits April 25, 2025 11:07

Address co pilots review :-)

948b6ef

Signed-off-by: Vibhu Jawa <[email protected]>

Merge branch 'main' into vjawa/fix_cuda_context_issues

16e404d

move cupy import into gpu_only_import

fbcaa53

Signed-off-by: Vibhu Jawa <[email protected]>

praateekmahajan approved these changes Apr 25, 2025

View reviewed changes

Add a catch for get_hostname()

b620bf6

Signed-off-by: Vibhu Jawa <[email protected]>

pentschev reviewed Apr 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues #675

Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues #675

Uh oh!

VibhuJawa commented Apr 18, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

praateekmahajan left a comment

Uh oh!

pentschev Apr 28, 2025

Uh oh!

Uh oh!

Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues #675

Are you sure you want to change the base?

Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues #675

Uh oh!

Conversation

VibhuJawa commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Local Test to verify this:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

praateekmahajan left a comment

Choose a reason for hiding this comment

Uh oh!

pentschev Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

VibhuJawa commented Apr 18, 2025 •

edited

Loading