-
Notifications
You must be signed in to change notification settings - Fork 124
Fix NeMo Curator Cluster Creation Cuda context issues #675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR reverts prior changes and refactors GPU cluster client creation to ensure proper CUDA context usage across multiple GPUs. The key changes include:
- Introducing functions (_worker_gpu_tuple and _assert_unique_gpu_per_host) to verify unique GPU assignment on each host.
- Adjusting the client initialization flow to perform GPU uniqueness checks for "gpu" clusters.
- Tidying up import comments in the modules initializer and modifying the PII deidentifier import for clarity.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
nemo_curator/utils/distributed_utils.py | Adds GPU worker functions and integrates a uniqueness assertion check |
nemo_curator/modules/init.py | Removes duplicate PyTorch-related import comments and clarifies order |
nemo_curator/modifiers/pii_modifier.py | Changes PiiDeidentifier import to a forward reference with a linter comment |
Comments suppressed due to low confidence (1)
nemo_curator/modifiers/pii_modifier.py:88
- [nitpick] If not hindered by circular dependency issues, consider importing 'PiiDeidentifier' directly to avoid the need for a forward reference and linter suppression.
def load_deidentifier(self) -> "PiiDeidentifier": # noqa: F821
Signed-off-by: Vibhu Jawa <[email protected]>
info = client.run(_worker_gpu_tuple) # {worker_addr: (host, gpu)} | ||
per_host = defaultdict(list) | ||
for host, gpu in info.values(): | ||
per_host[host].append(gpu) | ||
|
||
dups = {h: [g for g in set(gs) if gs.count(g) > 1] for h, gs in per_host.items() if len(gs) != len(set(gs))} | ||
if dups: | ||
raise RuntimeError( | ||
"Duplicate GPU assignment detected on host(s): " + ", ".join(f"{h}: {dups[h]}" for h in dups) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pentschev,
This is our current approach to check whether the CUDA context is unique across GPUs. It’s a recurring issue in Curator, as third-party libraries often interfere, and pinpointing the root cause is tricky.
Would love to hear your thoughts on whether a fix like this seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "whether the CUDA context is unique across GPUs"? Are you trying to check whether there's only one CUDA context per GPU? If so, "fixing" is not possible, you cannot guarantee some library won't create another context that you were not expecting, what you can do is warn about that. In Dask, we have a function to check whether a context exists, we use it like this:
- Get the expected device of the current worker;
- Check if a context has been already created;
a. If a device has already been created, it means some other library has already created the context that we didn't expect, so we raise a warning; - Create the CUDA context and check where (which GPU) the context has been created
a. If the context was created in the wrong GPU, raise a warning.
I would suggest simply replicating the above, it is not straightforward to get all corner cases right. I would also simply copy the relevant parts from https://github.com/dask/distributed/blob/main/distributed/diagnostics/nvml.py instead of rewriting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for replying.
If so, "fixing" is not possible, you cannot guarantee some library won't create another context that you were not expecting, what you can do is warn about that.
Just for clarity, we have no intention of fixing it. All we want is a strict 1‑GPU‑per‑worker model and to catch any accidental over‑subscription early.
The other interesting thing is that we don't get a warning for spacy/huggingface related cuda context problems. I think @ayushdg had done some investigation about this too.
I would also simply copy the relevant parts from https://github.com/dask/distributed/blob/main/distributed/diagnostics/nvml.py
This makes sense, let me try to fetch the device ids etc using the code linked above. Thanks for that, its helpful.
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR reverts previous changes and fixes CUDA context issues by ensuring unique GPU allocation across workers in the Dask-CUDA cluster while also cleaning up duplicate comments in the module initializer.
- Added helper functions _worker_gpu_tuple and _assert_unique_gpu_per_host to verify unique GPU assignments per host.
- Updated get_client to invoke the uniqueness check for GPU clusters.
- Removed duplicate comments in nemo_curator/modules/init.py.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
File | Description |
---|---|
nemo_curator/utils/distributed_utils.py | Added functions to retrieve worker GPU info and enforce GPU uniqueness; updated client return logic. |
nemo_curator/modules/init.py | Removed duplicate comments regarding PyTorch and cuGraph import ordering. |
Comments suppressed due to low confidence (1)
nemo_curator/utils/distributed_utils.py:103
- The newly added code calls warnings.warn but there is no explicit import of the warnings module in this file. Please ensure that 'import warnings' is added to avoid runtime errors.
warnings.warn(f"NVML error occurred: {e} while verifying GPU index", stacklevel=2)
Description
This PR reverts to changes made by @ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/61/files which were changed back to ruff changes
We also fix : https://github.com/NVIDIA/NeMo-Curator/pull/61/files by ensuring we always have cuda context spread across multiple GPUs.