Skip to content

Conversation

@suiyoubi
Copy link
Contributor

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@suiyoubi
Copy link
Contributor Author

/ok to test c8f7aa9

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 13, 2026

Greptile Summary

This PR improves video tutorial documentation with extensive enhancements to the semantic deduplication guide, better organization of video loading documentation, and corrections to broken reference links across multiple modalities.

Key Changes:

  • Substantially expands docs/curate-video/process-data/dedup.md with comprehensive examples showing both the simplified SemanticDeduplicationWorkflow and individual stage approach, detailed parameter tables, and practical guidance on determining the eps threshold
  • Reorganizes docs/curate-video/load-data/index.md with improved formatting and clearer separation of partitioning and reader stages
  • Corrects broken cross-reference links throughout the documentation (audio/text/image/video get-started guides and text deduplication index)

Critical Issue Found:

  • API Usage Error in dedup.md line 72-74: The example demonstrates workflow.run(executor) which will fail at runtime. The method signature requires named parameters kmeans_executor and pairwise_executor, where kmeans_executor must be a RayActorPoolExecutor. Passing a single XennaExecutor() positionally assigns it to kmeans_executor, causing a validation error.

Confidence Score: 2/5

  • This PR has a critical API usage error that will cause runtime failures for users following the documentation example.
  • The main documentation file (dedup.md) contains a critical error on line 72-74 where the workflow.run() method is called with an incorrect API signature. The example passes a single XennaExecutor() positionally, which will be assigned to the kmeans_executor parameter, but the actual implementation requires kmeans_executor to be a RayActorPoolExecutor. This will trigger a runtime validation error when users execute the provided code. The remaining 9 files in the PR are minor reference link corrections with confidence score 5/5. The low overall score is driven by this blocking issue in the primary documentation file.
  • docs/curate-video/process-data/dedup.md (critical API usage error on line 72-74)

Important Files Changed

Filename Overview
docs/curate-video/process-data/dedup.md Extensively improved documentation with detailed workflow examples and API reference. However, critical API usage error on line 74: workflow.run(executor) incorrectly passes a single XennaExecutor() which will fail runtime validation. Requires proper named parameter usage with correct executor types.
docs/curate-video/load-data/index.md Good documentation restructuring with improved organization and clarity. Reorganizes content into clearer sections with better formatting and list structure for readers.
docs/curate-text/process-data/deduplication/index.md Corrects broken reference links to semantic deduplication documentation from text-process-data-dedup-semdedup to text-process-data-format-sem-dedup.
docs/curate-audio/tutorials/beginner.md Corrects broken documentation reference links from file paths to proper reference labels (admin-installation and gs-audio).
docs/get-started/audio.md Corrects reference label from reference-infrastructure-container-environments-main to reference-infrastructure-container-environments.
docs/get-started/image.md Corrects reference label from reference-infrastructure-container-environments-main to reference-infrastructure-container-environments.
docs/get-started/text.md Corrects reference label from reference-infrastructure-container-environments-main to reference-infrastructure-container-environments.
docs/get-started/video.md Corrects reference label from reference-infrastructure-container-environments-main to reference-infrastructure-container-environments.

Sequence Diagram

sequenceDiagram
    participant User
    participant Docs as Documentation
    participant Workflow as SemanticDeduplicationWorkflow
    participant KMeans as KMeansStage
    participant Pairwise as PairwiseStage
    participant Identify as IdentifyDuplicatesStage

    User->>Docs: Reads dedup.md for API usage
    Docs->>User: Shows workflow.run(executor) example
    User->>Workflow: Calls workflow.run(XennaExecutor())
    Note over Workflow: ❌ ERROR: XennaExecutor assigned to<br/>kmeans_executor parameter<br/>Expects RayActorPoolExecutor
    
    Note over User,Workflow: CORRECT USAGE:
    User->>Workflow: Calls workflow.run(kmeans_executor=..., pairwise_executor=...)
    Workflow->>KMeans: Passes RayActorPoolExecutor
    Workflow->>Pairwise: Passes XennaExecutor
    KMeans->>Pairwise: K-means output
    Pairwise->>Identify: Similarity scores
    Identify->>User: Duplicate IDs
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +73 to +75
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect API usage: The workflow.run() method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None), but this example passes a single executor parameter.

When you call workflow.run(executor), the executor is interpreted as kmeans_executor. However, the SemanticDeduplicationWorkflow requires kmeans_executor to be a RayActorPoolExecutor (see line 343-345 of the workflow.py implementation). Passing XennaExecutor() will cause a runtime error: "kmeans_executor must be an instance of RayActorPoolExecutor."

Fix: Update the example to correctly specify both executors or rely on the defaults:

Suggested change
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
# Run the workflow (uses default executors: RayActorPoolExecutor for K-means, XennaExecutor for pairwise)
results = workflow.run()
# Or explicitly specify executors:
# from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor
# results = workflow.run(kmeans_executor=RayActorPoolExecutor(), pairwise_executor=XennaExecutor())

Alternatively, if you want to keep showing an explicit executor call, the documentation should clarify the correct parameter names and requirements.


The duplicate identification stages (`IdentifyDuplicatesStage` or `SemanticDeduplicationWorkflow` with `eps` specified) write Parquet files containing duplicate clip IDs to the output directory (typically `output_path/duplicates/`). These files contain a single column `id` with the IDs of clips that should be removed.

**It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset.** The removal process depends on how you want to persist and shard your data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete section: The "Removing Duplicates" section ends with a colon but provides no follow-up content. The sentence reads: "The removal process depends on how you want to persist and shard your data:" but nothing follows.

This appears to be incomplete documentation that should explain how users can actually filter out the duplicate IDs when exporting their final dataset. Consider adding concrete examples or instructions for common export scenarios (e.g., using pandas/dask to filter parquet files, filtering during WebDataset creation, etc.).

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 13, 2026

Greptile Overview

Greptile Summary

Summary

This PR improves video curation documentation by expanding the semantic deduplication guide and updating various cross-references.

Key Changes:

  1. Semantic Deduplication (dedup.md) - Major expansion:

    • Added new SemanticDeduplicationWorkflow end-to-end example with comprehensive parameter documentation
    • Expanded "Individual Stages" example with detailed pipeline construction
    • Added two helpful tips: one for determining the eps parameter through exploration, and another for custom ranking with metadata columns
    • Completed previously truncated parameter tables for all stages (KMeansStage, PairwiseStage, IdentifyDuplicatesStage, SemanticDeduplicationWorkflow)
    • Changed title from "Duplicate Removal" to "Duplicate Identification" (more accurate, as these stages only identify duplicates)
    • Added new "Removing Duplicates" section clarifying user responsibility for actual removal
  2. Reference Link Updates:

    • Fixed reference in index.md from custom stage tutorial to execution backends documentation (more relevant to the context)
  3. Documentation Restructuring:

    • Reorganized "How it Works" section in load-data/index.md for improved readability
    • Reordered sentences in frame-extraction.md for better logical flow

Issue Found:

The new SemanticDeduplicationWorkflow example contains a critical bug where workflow.run(executor) passes the executor to the wrong parameter position, which will cause a runtime error. The workflow requires RayActorPoolExecutor for k-means but the example passes XennaExecutor as the first positional argument.

Confidence Score: 3/5

  • This PR is mostly safe but contains one critical documentation bug that will cause runtime errors for users following the example code
  • The PR makes valuable improvements to documentation structure and adds comprehensive new content for semantic deduplication. However, the main "Single Step Workflow" example in dedup.md has a critical bug where the executor is passed incorrectly to workflow.run(), which will cause a TypeError at runtime when users try to follow the example. The rest of the changes (reference updates, restructuring) are solid improvements with no issues. Score of 3 reflects that most changes are good, but the critical bug in the most prominent example needs to be fixed before merge.
  • docs/curate-video/process-data/dedup.md requires attention - the SemanticDeduplicationWorkflow example needs correction before users encounter runtime errors

Important Files Changed

File Analysis

Filename Score Overview
docs/curate-video/index.md 5/5 Updated reference link from custom stage tutorial to execution backends documentation - correct and improves navigation
docs/curate-video/load-data/index.md 5/5 Restructured "How it Works" section for better readability, removed redundant subsections - improves documentation clarity
docs/curate-video/process-data/frame-extraction.md 5/5 Reordered sentences in "Before You Start" section for better logical flow - minor improvement
docs/curate-video/process-data/dedup.md 2/5 Major expansion with new SemanticDeduplicationWorkflow examples and parameter tables, but contains critical bug in workflow.run() usage that will cause runtime error

Sequence Diagram

sequenceDiagram
    participant User
    participant SemanticDeduplicationWorkflow
    participant KMeansStage
    participant PairwiseStage
    participant IdentifyDuplicatesStage
    participant RayActorPoolExecutor
    participant XennaExecutor

    User->>SemanticDeduplicationWorkflow: Initialize with input_path, n_clusters, eps, etc.
    User->>SemanticDeduplicationWorkflow: run(kmeans_executor, pairwise_executor)
    
    Note over SemanticDeduplicationWorkflow: Setup directories<br/>(kmeans_results, pairwise_results, duplicates)
    
    SemanticDeduplicationWorkflow->>KMeansStage: Create stage with n_clusters, embedding_dim, etc.
    SemanticDeduplicationWorkflow->>RayActorPoolExecutor: execute(KMeansStage)
    RayActorPoolExecutor->>KMeansStage: Process embeddings
    KMeansStage-->>SemanticDeduplicationWorkflow: Clustered embeddings with centroid distances
    
    SemanticDeduplicationWorkflow->>PairwiseStage: Create stage with ranking_strategy, batch_size
    SemanticDeduplicationWorkflow->>XennaExecutor: execute(PairwiseStage)
    XennaExecutor->>PairwiseStage: Compute within-cluster similarity
    PairwiseStage-->>SemanticDeduplicationWorkflow: Pairwise similarity scores
    
    alt eps is specified
        SemanticDeduplicationWorkflow->>IdentifyDuplicatesStage: Create stage with eps threshold
        SemanticDeduplicationWorkflow->>XennaExecutor: execute(IdentifyDuplicatesStage)
        XennaExecutor->>IdentifyDuplicatesStage: Filter pairs where cosine_sim >= 1.0 - eps
        IdentifyDuplicatesStage-->>SemanticDeduplicationWorkflow: Parquet files with duplicate IDs
    end
    
    SemanticDeduplicationWorkflow-->>User: Results dict with timing, duplicate counts
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +73 to +75
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example shows passing a single executor to workflow.run(), but the actual method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None).

When you pass executor as the first positional argument, it will be assigned to kmeans_executor, not pairwise_executor. However, the workflow explicitly requires kmeans_executor to be a RayActorPoolExecutor (see lines 343-345 of nemo_curator/stages/deduplication/semantic/workflow.py). Passing XennaExecutor() as kmeans_executor will cause a runtime error.

The correct usage should be:

Suggested change
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
# Run with XennaExecutor (GPU-accelerated)
from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor
kmeans_executor = RayActorPoolExecutor()
pairwise_executor = XennaExecutor()
results = workflow.run(kmeans_executor, pairwise_executor)

Or, to use defaults (which automatically set kmeans to RayActorPoolExecutor and pairwise to XennaExecutor):

Suggested change
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
# Run with default executors (RayActorPoolExecutor for kmeans, XennaExecutor for pairwise)
results = workflow.run()

Signed-off-by: Ao Tang <[email protected]>
@suiyoubi
Copy link
Contributor Author

/ok to test 05c5b8b

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. docs/curate-video/process-data/dedup.md, line 306-307 (link)

    style: This section ends abruptly. The sentence "The removal process depends on how you want to persist and shard your data:" suggests there should be follow-up content (examples showing how to filter duplicate IDs when exporting, using pandas/dask, WebDataset, etc.). Consider adding concrete examples of how to apply the duplicate filtering to common export scenarios.

  2. docs/curate-video/load-data/index.md, line 41-52 (link)

    style: The removed "Local Paths" and "Remote Paths" sections provided helpful guidance on configuring S3 credentials. Consider adding back a tip about credential configuration for cloud storage access, as this is critical for users working with S3, GCS, or other remote storage.

4 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +72 to +74
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: The example passes executor as a single positional argument, but workflow.run() expects named parameters: kmeans_executor and pairwise_executor. When passed positionally, executor is assigned to kmeans_executor, which must be a RayActorPoolExecutor (not XennaExecutor). This will cause a runtime error.

Suggested change
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
# Run with XennaExecutor for pairwise (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(pairwise_executor=executor)

Signed-off-by: Lawrence Lane <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

)
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Incorrect API usage - the workflow.run() method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None). When passing executor as a single positional argument, it is assigned to kmeans_executor, but SemanticDeduplicationWorkflow requires kmeans_executor to be a RayActorPoolExecutor, not XennaExecutor(). This will cause a runtime error.

The correct approach is to pass both executors with named parameters:

Suggested change
results = workflow.run(executor)
results = workflow.run(kmeans_executor=RayActorPoolExecutor(), pairwise_executor=executor)

Or rely on the defaults (kmeans defaults to RayActorPoolExecutor, pairwise defaults to XennaExecutor):

Suggested change
results = workflow.run(executor)
results = workflow.run(pairwise_executor=executor)


The duplicate identification stages (`IdentifyDuplicatesStage` or `SemanticDeduplicationWorkflow` with `eps` specified) write Parquet files containing duplicate clip IDs to the output directory (typically `output_path/duplicates/`). These files contain a single column `id` with the IDs of clips that should be removed.

**It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset.** The removal process depends on how you want to persist and shard your data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The "Removing Duplicates" section ends incomplete. The sentence states: "The removal process depends on how you want to persist and shard your data:" but provides no follow-up content. This leaves users without instructions on how to actually filter duplicate IDs from their final dataset. Consider adding concrete examples or instructions for common export scenarios (e.g., using pandas/dask to filter parquet files, filtering during WebDataset creation, etc.).

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +72 to +74
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: The workflow.run() method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None). When passing executor as a single positional argument, it is assigned to kmeans_executor, but the workflow requires kmeans_executor to be a RayActorPoolExecutor, not XennaExecutor(). This will cause a runtime error: "kmeans_executor must be an instance of RayActorPoolExecutor."

Suggested change
# Run with XennaExecutor (GPU-accelerated)
executor = XennaExecutor()
results = workflow.run(executor)
# Run with RayActorPoolExecutor for kmeans and XennaExecutor for pairwise (GPU-accelerated)
from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor
kmeans_executor = RayActorPoolExecutor()
pairwise_executor = XennaExecutor()
results = workflow.run(kmeans_executor=kmeans_executor, pairwise_executor=pairwise_executor)

@lbliii lbliii mentioned this pull request Jan 13, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants