Skip to content

Fix NeMo Curator Cluster Creation Cuda context issues #675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

VibhuJawa
Copy link
Collaborator

@VibhuJawa VibhuJawa commented Apr 18, 2025

Description

This PR reverts to changes made by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/61/files which were changed back to ruff changes

We also fix : https://github.com/NVIDIA/NeMo-Curator/pull/61/files by ensuring we always have cuda context spread across multiple GPUs.

@VibhuJawa VibhuJawa requested a review from Copilot April 18, 2025 20:51
@VibhuJawa VibhuJawa marked this pull request as ready for review April 18, 2025 20:51
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reverts prior changes and refactors GPU cluster client creation to ensure proper CUDA context usage across multiple GPUs. The key changes include:

  • Introducing functions (_worker_gpu_tuple and _assert_unique_gpu_per_host) to verify unique GPU assignment on each host.
  • Adjusting the client initialization flow to perform GPU uniqueness checks for "gpu" clusters.
  • Tidying up import comments in the modules initializer and modifying the PII deidentifier import for clarity.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
nemo_curator/utils/distributed_utils.py Adds GPU worker functions and integrates a uniqueness assertion check
nemo_curator/modules/init.py Removes duplicate PyTorch-related import comments and clarifies order
nemo_curator/modifiers/pii_modifier.py Changes PiiDeidentifier import to a forward reference with a linter comment
Comments suppressed due to low confidence (1)

nemo_curator/modifiers/pii_modifier.py:88

  • [nitpick] If not hindered by circular dependency issues, consider importing 'PiiDeidentifier' directly to avoid the need for a forward reference and linter suppression.
def load_deidentifier(self) -> "PiiDeidentifier":  # noqa: F821

Comment on lines 113 to 122
info = client.run(_worker_gpu_tuple) # {worker_addr: (host, gpu)}
per_host = defaultdict(list)
for host, gpu in info.values():
per_host[host].append(gpu)

dups = {h: [g for g in set(gs) if gs.count(g) > 1] for h, gs in per_host.items() if len(gs) != len(set(gs))}
if dups:
raise RuntimeError(
"Duplicate GPU assignment detected on host(s): " + ", ".join(f"{h}: {dups[h]}" for h in dups)
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pentschev,
This is our current approach to check whether the CUDA context is unique across GPUs. It’s a recurring issue in Curator, as third-party libraries often interfere, and pinpointing the root cause is tricky.

Would love to hear your thoughts on whether a fix like this seems reasonable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "whether the CUDA context is unique across GPUs"? Are you trying to check whether there's only one CUDA context per GPU? If so, "fixing" is not possible, you cannot guarantee some library won't create another context that you were not expecting, what you can do is warn about that. In Dask, we have a function to check whether a context exists, we use it like this:

  1. Get the expected device of the current worker;
  2. Check if a context has been already created;
    a. If a device has already been created, it means some other library has already created the context that we didn't expect, so we raise a warning;
  3. Create the CUDA context and check where (which GPU) the context has been created
    a. If the context was created in the wrong GPU, raise a warning.

I would suggest simply replicating the above, it is not straightforward to get all corner cases right. I would also simply copy the relevant parts from https://github.com/dask/distributed/blob/main/distributed/diagnostics/nvml.py instead of rewriting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for replying.

If so, "fixing" is not possible, you cannot guarantee some library won't create another context that you were not expecting, what you can do is warn about that.

Just for clarity, we have no intention of fixing it. All we want is a strict 1‑GPU‑per‑worker model and to catch any accidental over‑subscription early.

The other interesting thing is that we don't get a warning for spacy/huggingface related cuda context problems. I think @ayushdg had done some investigation about this too.

I would also simply copy the relevant parts from https://github.com/dask/distributed/blob/main/distributed/diagnostics/nvml.py

This makes sense, let me try to fetch the device ids etc using the code linked above. Thanks for that, its helpful.

@VibhuJawa VibhuJawa added the gpuci Run GPU CI/CD on PR label Apr 18, 2025
@VibhuJawa VibhuJawa requested a review from Copilot April 21, 2025 20:40
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reverts previous changes and fixes CUDA context issues by ensuring unique GPU allocation across workers in the Dask-CUDA cluster while also cleaning up duplicate comments in the module initializer.

  • Added helper functions _worker_gpu_tuple and _assert_unique_gpu_per_host to verify unique GPU assignments per host.
  • Updated get_client to invoke the uniqueness check for GPU clusters.
  • Removed duplicate comments in nemo_curator/modules/init.py.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
nemo_curator/utils/distributed_utils.py Added functions to retrieve worker GPU info and enforce GPU uniqueness; updated client return logic.
nemo_curator/modules/init.py Removed duplicate comments regarding PyTorch and cuGraph import ordering.
Comments suppressed due to low confidence (1)

nemo_curator/utils/distributed_utils.py:103

  • The newly added code calls warnings.warn but there is no explicit import of the warnings module in this file. Please ensure that 'import warnings' is added to avoid runtime errors.
warnings.warn(f"NVML error occurred: {e} while verifying GPU index", stacklevel=2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants