Skip to content

Conversation

@KunalSachdev2005
Copy link

Description

Warn users when n_clusters < 1000 as this may cause OOM errors for large datasets since each cluster must fit in memory.
Fixes: #1224

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Warn users when n_clusters < 1000 as this may cause OOM errors
for large datasets since each cluster must fit in memory.

Fixes: NVIDIA-NeMo#1224
Signed-off-by: Kunal Sachdev <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 15, 2026

Greptile Summary

Added a warning when n_clusters < 1000 to prevent out-of-memory errors on large datasets, addressing issue #1224.

Changes made:

  • Added MIN_RECOMMENDED_N_CLUSTERS = 1000 constant at the module level
  • Added validation check in _validate_config() method that logs a warning when n_clusters is below the threshold
  • Used logger.warning (consistent with project patterns) to inform users about potential OOM risks
  • Warning message clearly explains the issue and recommends using n_clusters >= 1000 for large datasets

Note: A previous reviewer suggested adding a test to verify the warning is raised. This would be a good addition to prevent regression, but the core functionality is correct as-is.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The change is a simple, non-breaking addition of a warning message that follows the existing codebase patterns. It uses logger.warning consistently with other warnings in the project, adds a well-documented constant, and addresses the issue requirements. The warning logic is straightforward with no edge cases or potential bugs.
  • No files require special attention

Important Files Changed

Filename Overview
nemo_curator/stages/deduplication/semantic/workflow.py Added warning when n_clusters < 1000 to prevent OOM errors on large datasets - implementation is correct and follows project patterns

Sequence Diagram

sequenceDiagram
    participant User
    participant Workflow as SemanticDeduplicationWorkflow
    participant Logger
    
    User->>Workflow: __init__(n_clusters=500, ...)
    Workflow->>Workflow: Store configuration parameters
    Workflow->>Workflow: _validate_config()
    
    alt n_clusters < MIN_RECOMMENDED_N_CLUSTERS (1000)
        Workflow->>Logger: logger.warning("n_clusters=500 is less than 1000...")
        Logger-->>User: Display warning message
    end
    
    Workflow->>Workflow: Continue validation (distance_metric, etc.)
    Workflow-->>User: Workflow initialized
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 203 to 210
if self.n_clusters < MIN_RECOMMENDED_N_CLUSTERS:
warnings.warn(
f"n_clusters={self.n_clusters} is less than {MIN_RECOMMENDED_N_CLUSTERS}. "
"For large datasets, this may result in out-of-memory errors since "
f"each cluster must fit in memory. Consider using n_clusters >= {MIN_RECOMMENDED_N_CLUSTERS} for large datasets.",
UserWarning,
stacklevel=2,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider adding a test to verify this warning is raised when n_clusters < 1000. This ensures the feature works as expected and prevents regression.

# In test_workflow.py
def test_warning_for_small_n_clusters():
    with pytest.warns(UserWarning, match="n_clusters=.*is less than 1000"):
        SemanticDeduplicationWorkflow(
            input_path="/tmp/test",
            output_path="/tmp/output", 
            n_clusters=500
        )

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 15, 2026

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

@sarahyurick
Copy link
Contributor

/ok to test aa6cab8


# Warn if n_clusters is too small for large datasets
if self.n_clusters < MIN_RECOMMENDED_N_CLUSTERS:
warnings.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use logger.warning instead of warnings.warn...
wdyt @sarahyurick / @ayushdg

Replaced warnings.warn with logger.warning as requested in PR review.

Signed-off-by: Kunal Sachdev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Warn if n_clusters < 1000 as a default in SemDedup

3 participants