-
Notifications
You must be signed in to change notification settings - Fork 141
Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues #675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues #675
Conversation
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR reverts prior changes and refactors GPU cluster client creation to ensure proper CUDA context usage across multiple GPUs. The key changes include:
- Introducing functions (_worker_gpu_tuple and _assert_unique_gpu_per_host) to verify unique GPU assignment on each host.
- Adjusting the client initialization flow to perform GPU uniqueness checks for "gpu" clusters.
- Tidying up import comments in the modules initializer and modifying the PII deidentifier import for clarity.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
nemo_curator/utils/distributed_utils.py | Adds GPU worker functions and integrates a uniqueness assertion check |
nemo_curator/modules/init.py | Removes duplicate PyTorch-related import comments and clarifies order |
nemo_curator/modifiers/pii_modifier.py | Changes PiiDeidentifier import to a forward reference with a linter comment |
Comments suppressed due to low confidence (1)
nemo_curator/modifiers/pii_modifier.py:88
- [nitpick] If not hindered by circular dependency issues, consider importing 'PiiDeidentifier' directly to avoid the need for a forward reference and linter suppression.
def load_deidentifier(self) -> "PiiDeidentifier": # noqa: F821
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR reverts previous changes and fixes CUDA context issues by ensuring unique GPU allocation across workers in the Dask-CUDA cluster while also cleaning up duplicate comments in the module initializer.
- Added helper functions _worker_gpu_tuple and _assert_unique_gpu_per_host to verify unique GPU assignments per host.
- Updated get_client to invoke the uniqueness check for GPU clusters.
- Removed duplicate comments in nemo_curator/modules/init.py.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
File | Description |
---|---|
nemo_curator/utils/distributed_utils.py | Added functions to retrieve worker GPU info and enforce GPU uniqueness; updated client return logic. |
nemo_curator/modules/init.py | Removed duplicate comments regarding PyTorch and cuGraph import ordering. |
Comments suppressed due to low confidence (1)
nemo_curator/utils/distributed_utils.py:103
- The newly added code calls warnings.warn but there is no explicit import of the warnings module in this file. Please ensure that 'import warnings' is added to avoid runtime errors.
warnings.warn(f"NVML error occurred: {e} while verifying GPU index", stacklevel=2)
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR reverts previous changes to regain the ruff formatting while introducing a check to ensure that each Dask worker on a given host owns a unique GPU. Key changes include:
- Adding a new helper (_worker_gpu_tuple) to initialize the CUDA context on workers.
- Implementing _assert_unique_gpu_per_host to verify unique GPU assignment on a host.
- Minor reordering of import comments in init.py.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
nemo_curator/utils/distributed_utils.py | Introduces GPU context initialization and duplicate GPU assignment checks. |
nemo_curator/modules/init.py | Removes redundant PyTorch import comments. |
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Vibhu Jawa <[email protected]>
""" | ||
|
||
# Touch the GPU so a context is created (idempotent if one already exists) | ||
cp.cuda.runtime.getDevice() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not create a CUDA context, you can verify that by watching nvidia-smi
while you launch a process that calls this. We canonically use numba.cuda.current_context()
in Dask, I would suggest the same because that is already tested and known to work without any side-effects.
Description
This PR fixes : https://github.com/NVIDIA/NeMo-Curator/pull/61/files by ensuring we always have cuda context spread across multiple GPUs.
Local Test to verify this:
With PR:
Without PR (No error/warnings are raised):
But
nvidia-smi
, looks like below: