Skip to content

Conversation

@praateekmahajan
Copy link
Contributor

Description

As title, since cuml 25.10 is incompatible and the import fails if user installs sklearn 1.8.0+ unintentionally.

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 13, 2026

Greptile Overview

Greptile Summary

Pins scikit-learn to <1.8.0 in both deduplication_cuda12 extras and test dependencies to prevent import failures with cuml 25.10.

Key Changes:

  • Added scikit-learn<1.8.0 to deduplication_cuda12 optional dependency group
  • Added scikit-learn<1.8.0 to test dependency group
  • Updated uv.lock to reflect the new constraint with revision bump from 2 to 3

Context:
The change is necessary because cuml 25.10 has incompatibility issues with sklearn 1.8.0+. The codebase uses cuml's KMeans implementation for semantic deduplication (imported in nemo_curator/stages/deduplication/semantic/kmeans.py), and tests use sklearn's make_blobs and adjusted_rand_score utilities.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The change is a straightforward dependency version constraint that prevents a known incompatibility. It only adds an upper bound to sklearn version (not changing existing functionality), applies consistently to both production and test environments, and follows the existing pattern of version pinning in the project (e.g., torch<=2.9.0, flash-attn<=2.8.3)
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
pyproject.toml 5/5 Added scikit-learn<1.8.0 constraint to deduplication_cuda12 extras and test dependencies to prevent cuml 25.10 import failures
uv.lock 5/5 Updated lockfile to reflect new scikit-learn<1.8.0 constraint, adding dependency to cuda12 extras and updating test group

Sequence Diagram

sequenceDiagram
    participant User
    participant PyProject as pyproject.toml
    participant UVLock as uv.lock
    participant PackageManager as Package Manager (pip/uv)
    participant CuML as cuml-cu12 (25.10.*)
    participant SKLearn as scikit-learn

    User->>PyProject: Install with deduplication_cuda12 extras
    PyProject->>PackageManager: Request cuml-cu12==25.10.*
    PyProject->>PackageManager: Request scikit-learn<1.8.0
    PackageManager->>UVLock: Resolve dependencies
    UVLock->>PackageManager: Return resolved versions
    PackageManager->>CuML: Install cuml-cu12 25.10.*
    PackageManager->>SKLearn: Install scikit-learn <1.8.0
    Note over CuML,SKLearn: Version constraint prevents<br/>incompatibility
    SKLearn-->>CuML: Compatible version installed
    CuML-->>User: Import successful

    User->>PyProject: Run tests with test dependencies
    PyProject->>PackageManager: Request scikit-learn<1.8.0
    PackageManager->>SKLearn: Install scikit-learn <1.8.0
    SKLearn-->>User: Test utilities available
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ayushdg ayushdg added the r1.1.0 Pick this label for auto cherry-picking into r1.1.0 label Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r1.1.0 Pick this label for auto cherry-picking into r1.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants