-
Notifications
You must be signed in to change notification settings - Fork 210
Add warning for small n_clusters in SemanticDeduplicationWorkflow #1376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add warning for small n_clusters in SemanticDeduplicationWorkflow #1376
Conversation
Warn users when n_clusters < 1000 as this may cause OOM errors for large datasets since each cluster must fit in memory. Fixes: NVIDIA-NeMo#1224 Signed-off-by: Kunal Sachdev <[email protected]>
Greptile SummaryAdded a warning when Changes made:
Note: A previous reviewer suggested adding a test to verify the warning is raised. This would be a good addition to prevent regression, but the core functionality is correct as-is. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Workflow as SemanticDeduplicationWorkflow
participant Logger
User->>Workflow: __init__(n_clusters=500, ...)
Workflow->>Workflow: Store configuration parameters
Workflow->>Workflow: _validate_config()
alt n_clusters < MIN_RECOMMENDED_N_CLUSTERS (1000)
Workflow->>Logger: logger.warning("n_clusters=500 is less than 1000...")
Logger-->>User: Display warning message
end
Workflow->>Workflow: Continue validation (distance_metric, etc.)
Workflow-->>User: Workflow initialized
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
| if self.n_clusters < MIN_RECOMMENDED_N_CLUSTERS: | ||
| warnings.warn( | ||
| f"n_clusters={self.n_clusters} is less than {MIN_RECOMMENDED_N_CLUSTERS}. " | ||
| "For large datasets, this may result in out-of-memory errors since " | ||
| f"each cluster must fit in memory. Consider using n_clusters >= {MIN_RECOMMENDED_N_CLUSTERS} for large datasets.", | ||
| UserWarning, | ||
| stacklevel=2, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider adding a test to verify this warning is raised when n_clusters < 1000. This ensures the feature works as expected and prevents regression.
# In test_workflow.py
def test_warning_for_small_n_clusters():
with pytest.warns(UserWarning, match="n_clusters=.*is less than 1000"):
SemanticDeduplicationWorkflow(
input_path="/tmp/test",
output_path="/tmp/output",
n_clusters=500
)Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
sarahyurick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @KunalSachdev2005 !
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
|
/ok to test aa6cab8 |
|
|
||
| # Warn if n_clusters is too small for large datasets | ||
| if self.n_clusters < MIN_RECOMMENDED_N_CLUSTERS: | ||
| warnings.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use logger.warning instead of warnings.warn...
wdyt @sarahyurick / @ayushdg
Replaced warnings.warn with logger.warning as requested in PR review. Signed-off-by: Kunal Sachdev <[email protected]>
Description
Warn users when n_clusters < 1000 as this may cause OOM errors for large datasets since each cluster must fit in memory.
Fixes: #1224
Usage
# Add snippet demonstrating usageChecklist