Add warning for small n_clusters in SemanticDeduplicationWorkflow #1376

KunalSachdev2005 · 2026-01-15T04:59:04Z

Description

Warn users when n_clusters < 1000 as this may cause OOM errors for large datasets since each cluster must fit in memory.
Fixes: #1224

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Warn users when n_clusters < 1000 as this may cause OOM errors for large datasets since each cluster must fit in memory. Fixes: NVIDIA-NeMo#1224 Signed-off-by: Kunal Sachdev <[email protected]>

copy-pr-bot · 2026-01-15T04:59:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-01-15T05:01:12Z

Greptile Summary

Added a warning when n_clusters < 1000 to prevent out-of-memory errors on large datasets, addressing issue #1224.

Changes made:

Added MIN_RECOMMENDED_N_CLUSTERS = 1000 constant at the module level
Added validation check in _validate_config() method that logs a warning when n_clusters is below the threshold
Used logger.warning (consistent with project patterns) to inform users about potential OOM risks
Warning message clearly explains the issue and recommends using n_clusters >= 1000 for large datasets

Note: A previous reviewer suggested adding a test to verify the warning is raised. This would be a good addition to prevent regression, but the core functionality is correct as-is.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The change is a simple, non-breaking addition of a warning message that follows the existing codebase patterns. It uses logger.warning consistently with other warnings in the project, adds a well-documented constant, and addresses the issue requirements. The warning logic is straightforward with no edge cases or potential bugs.
No files require special attention

Important Files Changed

Filename	Overview
nemo_curator/stages/deduplication/semantic/workflow.py	Added warning when `n_clusters < 1000` to prevent OOM errors on large datasets - implementation is correct and follows project patterns

Sequence Diagram

sequenceDiagram
    participant User
    participant Workflow as SemanticDeduplicationWorkflow
    participant Logger
    
    User->>Workflow: __init__(n_clusters=500, ...)
    Workflow->>Workflow: Store configuration parameters
    Workflow->>Workflow: _validate_config()
    
    alt n_clusters < MIN_RECOMMENDED_N_CLUSTERS (1000)
        Workflow->>Logger: logger.warning("n_clusters=500 is less than 1000...")
        Logger-->>User: Display warning message
    end
    
    Workflow->>Workflow: Continue validation (distance_metric, etc.)
    Workflow-->>User: Workflow initialized

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-15T05:01:12Z

nemo_curator/stages/deduplication/semantic/workflow.py

+        if self.n_clusters < MIN_RECOMMENDED_N_CLUSTERS:
+            warnings.warn(
+                f"n_clusters={self.n_clusters} is less than {MIN_RECOMMENDED_N_CLUSTERS}. "
+                "For large datasets, this may result in out-of-memory errors since "
+                f"each cluster must fit in memory. Consider using n_clusters >= {MIN_RECOMMENDED_N_CLUSTERS} for large datasets.",
+                UserWarning,
+                stacklevel=2,
+            )


style: Consider adding a test to verify this warning is raised when n_clusters < 1000. This ensures the feature works as expected and prevents regression.

# In test_workflow.py def test_warning_for_small_n_clusters(): with pytest.warns(UserWarning, match="n_clusters=.*is less than 1000"): SemanticDeduplicationWorkflow( input_path="/tmp/test", output_path="/tmp/output", n_clusters=500 )

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

sarahyurick

Thanks @KunalSachdev2005 !

greptile-apps · 2026-01-15T17:20:59Z

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

sarahyurick · 2026-01-15T17:25:13Z

/ok to test aa6cab8

praateekmahajan · 2026-01-15T17:30:44Z

nemo_curator/stages/deduplication/semantic/workflow.py


+        # Warn if n_clusters is too small for large datasets
+        if self.n_clusters < MIN_RECOMMENDED_N_CLUSTERS:
+            warnings.warn(


Let's use logger.warning instead of warnings.warn...
wdyt @sarahyurick / @ayushdg

Replaced warnings.warn with logger.warning as requested in PR review. Signed-off-by: Kunal Sachdev <[email protected]>

Add warning for small n_clusters in SemanticDeduplicationWorkflow

64fa471

Warn users when n_clusters < 1000 as this may cause OOM errors for large datasets since each cluster must fit in memory. Fixes: NVIDIA-NeMo#1224 Signed-off-by: Kunal Sachdev <[email protected]>

github-actions bot added the community-request label Jan 15, 2026

greptile-apps bot reviewed Jan 15, 2026

View reviewed changes

sarahyurick approved these changes Jan 15, 2026

View reviewed changes

Merge branch 'main' into warn-small-n-clusters

aa6cab8

sarahyurick requested a review from praateekmahajan January 15, 2026 17:19

copy-pr-bot bot temporarily deployed to test January 15, 2026 17:25 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 15, 2026 17:25 Inactive

praateekmahajan reviewed Jan 15, 2026

View reviewed changes

Used logger.warning instead of warnings.warn

7a89747

Replaced warnings.warn with logger.warning as requested in PR review. Signed-off-by: Kunal Sachdev <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add warning for small n_clusters in SemanticDeduplicationWorkflow #1376

Add warning for small n_clusters in SemanticDeduplicationWorkflow #1376

KunalSachdev2005 commented Jan 15, 2026

Uh oh!

copy-pr-bot bot commented Jan 15, 2026

Uh oh!

greptile-apps bot commented Jan 15, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 15, 2026

Uh oh!

sarahyurick left a comment

Uh oh!

greptile-apps bot commented Jan 15, 2026

Uh oh!

sarahyurick commented Jan 15, 2026

Uh oh!

praateekmahajan Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add warning for small n_clusters in SemanticDeduplicationWorkflow #1376

Are you sure you want to change the base?

Add warning for small n_clusters in SemanticDeduplicationWorkflow #1376

Conversation

KunalSachdev2005 commented Jan 15, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Jan 15, 2026

Uh oh!

greptile-apps bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 15, 2026

Greptile found no issues!

Uh oh!

sarahyurick commented Jan 15, 2026

Uh oh!

praateekmahajan Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Jan 15, 2026 •

edited

Loading