Video tutorial improve #1367

suiyoubi · 2026-01-13T14:18:08Z

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>

Signed-off-by: Ao Tang <[email protected]>

copy-pr-bot · 2026-01-13T14:18:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

suiyoubi · 2026-01-13T14:18:30Z

/ok to test c8f7aa9

greptile-apps · 2026-01-13T14:22:53Z

Greptile Summary

This PR improves video tutorial documentation with extensive enhancements to the semantic deduplication guide, better organization of video loading documentation, and corrections to broken reference links across multiple modalities.

Key Changes:

Substantially expands docs/curate-video/process-data/dedup.md with comprehensive examples showing both the simplified SemanticDeduplicationWorkflow and individual stage approach, detailed parameter tables, and practical guidance on determining the eps threshold
Reorganizes docs/curate-video/load-data/index.md with improved formatting and clearer separation of partitioning and reader stages
Corrects broken cross-reference links throughout the documentation (audio/text/image/video get-started guides and text deduplication index)

Critical Issue Found:

API Usage Error in dedup.md line 72-74: The example demonstrates workflow.run(executor) which will fail at runtime. The method signature requires named parameters kmeans_executor and pairwise_executor, where kmeans_executor must be a RayActorPoolExecutor. Passing a single XennaExecutor() positionally assigns it to kmeans_executor, causing a validation error.

Confidence Score: 2/5

This PR has a critical API usage error that will cause runtime failures for users following the documentation example.
The main documentation file (dedup.md) contains a critical error on line 72-74 where the workflow.run() method is called with an incorrect API signature. The example passes a single XennaExecutor() positionally, which will be assigned to the kmeans_executor parameter, but the actual implementation requires kmeans_executor to be a RayActorPoolExecutor. This will trigger a runtime validation error when users execute the provided code. The remaining 9 files in the PR are minor reference link corrections with confidence score 5/5. The low overall score is driven by this blocking issue in the primary documentation file.
docs/curate-video/process-data/dedup.md (critical API usage error on line 72-74)

Important Files Changed

Filename	Overview
docs/curate-video/process-data/dedup.md	Extensively improved documentation with detailed workflow examples and API reference. However, critical API usage error on line 74: `workflow.run(executor)` incorrectly passes a single `XennaExecutor()` which will fail runtime validation. Requires proper named parameter usage with correct executor types.
docs/curate-video/load-data/index.md	Good documentation restructuring with improved organization and clarity. Reorganizes content into clearer sections with better formatting and list structure for readers.
docs/curate-text/process-data/deduplication/index.md	Corrects broken reference links to semantic deduplication documentation from `text-process-data-dedup-semdedup` to `text-process-data-format-sem-dedup`.
docs/curate-audio/tutorials/beginner.md	Corrects broken documentation reference links from file paths to proper reference labels (`admin-installation` and `gs-audio`).
docs/get-started/audio.md	Corrects reference label from `reference-infrastructure-container-environments-main` to `reference-infrastructure-container-environments`.
docs/get-started/image.md	Corrects reference label from `reference-infrastructure-container-environments-main` to `reference-infrastructure-container-environments`.
docs/get-started/text.md	Corrects reference label from `reference-infrastructure-container-environments-main` to `reference-infrastructure-container-environments`.
docs/get-started/video.md	Corrects reference label from `reference-infrastructure-container-environments-main` to `reference-infrastructure-container-environments`.

Sequence Diagram

sequenceDiagram
    participant User
    participant Docs as Documentation
    participant Workflow as SemanticDeduplicationWorkflow
    participant KMeans as KMeansStage
    participant Pairwise as PairwiseStage
    participant Identify as IdentifyDuplicatesStage

    User->>Docs: Reads dedup.md for API usage
    Docs->>User: Shows workflow.run(executor) example
    User->>Workflow: Calls workflow.run(XennaExecutor())
    Note over Workflow: ❌ ERROR: XennaExecutor assigned to<br/>kmeans_executor parameter<br/>Expects RayActorPoolExecutor
    
    Note over User,Workflow: CORRECT USAGE:
    User->>Workflow: Calls workflow.run(kmeans_executor=..., pairwise_executor=...)
    Workflow->>KMeans: Passes RayActorPoolExecutor
    Workflow->>Pairwise: Passes XennaExecutor
    KMeans->>Pairwise: K-means output
    Pairwise->>Identify: Similarity scores
    Identify->>User: Duplicate IDs

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T14:22:57Z

docs/curate-video/process-data/dedup.md

+# Run with XennaExecutor (GPU-accelerated)
+executor = XennaExecutor()
+results = workflow.run(executor)


Incorrect API usage: The workflow.run() method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None), but this example passes a single executor parameter.

When you call workflow.run(executor), the executor is interpreted as kmeans_executor. However, the SemanticDeduplicationWorkflow requires kmeans_executor to be a RayActorPoolExecutor (see line 343-345 of the workflow.py implementation). Passing XennaExecutor() will cause a runtime error: "kmeans_executor must be an instance of RayActorPoolExecutor."

Fix: Update the example to correctly specify both executors or rely on the defaults:

Suggested change

# Run with XennaExecutor (GPU-accelerated)

executor = XennaExecutor()

results = workflow.run(executor)

# Run the workflow (uses default executors: RayActorPoolExecutor for K-means, XennaExecutor for pairwise)

results = workflow.run()

# Or explicitly specify executors:

# from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor

# results = workflow.run(kmeans_executor=RayActorPoolExecutor(), pairwise_executor=XennaExecutor())

Alternatively, if you want to keep showing an explicit executor call, the documentation should clarify the correct parameter names and requirements.

greptile-apps · 2026-01-13T14:22:58Z

docs/curate-video/process-data/dedup.md

+
+The duplicate identification stages (`IdentifyDuplicatesStage` or `SemanticDeduplicationWorkflow` with `eps` specified) write Parquet files containing duplicate clip IDs to the output directory (typically `output_path/duplicates/`). These files contain a single column `id` with the IDs of clips that should be removed.
+
+**It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset.** The removal process depends on how you want to persist and shard your data:


Incomplete section: The "Removing Duplicates" section ends with a colon but provides no follow-up content. The sentence reads: "The removal process depends on how you want to persist and shard your data:" but nothing follows.

This appears to be incomplete documentation that should explain how users can actually filter out the duplicate IDs when exporting their final dataset. Consider adding concrete examples or instructions for common export scenarios (e.g., using pandas/dask to filter parquet files, filtering during WebDataset creation, etc.).

greptile-apps · 2026-01-13T14:26:16Z

Greptile Overview

Greptile Summary

Summary

This PR improves video curation documentation by expanding the semantic deduplication guide and updating various cross-references.

Key Changes:

Semantic Deduplication (dedup.md) - Major expansion:
- Added new SemanticDeduplicationWorkflow end-to-end example with comprehensive parameter documentation
- Expanded "Individual Stages" example with detailed pipeline construction
- Added two helpful tips: one for determining the eps parameter through exploration, and another for custom ranking with metadata columns
- Completed previously truncated parameter tables for all stages (KMeansStage, PairwiseStage, IdentifyDuplicatesStage, SemanticDeduplicationWorkflow)
- Changed title from "Duplicate Removal" to "Duplicate Identification" (more accurate, as these stages only identify duplicates)
- Added new "Removing Duplicates" section clarifying user responsibility for actual removal
Reference Link Updates:
- Fixed reference in index.md from custom stage tutorial to execution backends documentation (more relevant to the context)
Documentation Restructuring:
- Reorganized "How it Works" section in load-data/index.md for improved readability
- Reordered sentences in frame-extraction.md for better logical flow

Issue Found:

The new SemanticDeduplicationWorkflow example contains a critical bug where workflow.run(executor) passes the executor to the wrong parameter position, which will cause a runtime error. The workflow requires RayActorPoolExecutor for k-means but the example passes XennaExecutor as the first positional argument.

Confidence Score: 3/5

This PR is mostly safe but contains one critical documentation bug that will cause runtime errors for users following the example code
The PR makes valuable improvements to documentation structure and adds comprehensive new content for semantic deduplication. However, the main "Single Step Workflow" example in dedup.md has a critical bug where the executor is passed incorrectly to workflow.run(), which will cause a TypeError at runtime when users try to follow the example. The rest of the changes (reference updates, restructuring) are solid improvements with no issues. Score of 3 reflects that most changes are good, but the critical bug in the most prominent example needs to be fixed before merge.
docs/curate-video/process-data/dedup.md requires attention - the SemanticDeduplicationWorkflow example needs correction before users encounter runtime errors

Important Files Changed

File Analysis

Filename	Score	Overview
docs/curate-video/index.md	5/5	Updated reference link from custom stage tutorial to execution backends documentation - correct and improves navigation
docs/curate-video/load-data/index.md	5/5	Restructured "How it Works" section for better readability, removed redundant subsections - improves documentation clarity
docs/curate-video/process-data/frame-extraction.md	5/5	Reordered sentences in "Before You Start" section for better logical flow - minor improvement
docs/curate-video/process-data/dedup.md	2/5	Major expansion with new SemanticDeduplicationWorkflow examples and parameter tables, but contains critical bug in workflow.run() usage that will cause runtime error

Sequence Diagram

sequenceDiagram
    participant User
    participant SemanticDeduplicationWorkflow
    participant KMeansStage
    participant PairwiseStage
    participant IdentifyDuplicatesStage
    participant RayActorPoolExecutor
    participant XennaExecutor

    User->>SemanticDeduplicationWorkflow: Initialize with input_path, n_clusters, eps, etc.
    User->>SemanticDeduplicationWorkflow: run(kmeans_executor, pairwise_executor)
    
    Note over SemanticDeduplicationWorkflow: Setup directories<br/>(kmeans_results, pairwise_results, duplicates)
    
    SemanticDeduplicationWorkflow->>KMeansStage: Create stage with n_clusters, embedding_dim, etc.
    SemanticDeduplicationWorkflow->>RayActorPoolExecutor: execute(KMeansStage)
    RayActorPoolExecutor->>KMeansStage: Process embeddings
    KMeansStage-->>SemanticDeduplicationWorkflow: Clustered embeddings with centroid distances
    
    SemanticDeduplicationWorkflow->>PairwiseStage: Create stage with ranking_strategy, batch_size
    SemanticDeduplicationWorkflow->>XennaExecutor: execute(PairwiseStage)
    XennaExecutor->>PairwiseStage: Compute within-cluster similarity
    PairwiseStage-->>SemanticDeduplicationWorkflow: Pairwise similarity scores
    
    alt eps is specified
        SemanticDeduplicationWorkflow->>IdentifyDuplicatesStage: Create stage with eps threshold
        SemanticDeduplicationWorkflow->>XennaExecutor: execute(IdentifyDuplicatesStage)
        XennaExecutor->>IdentifyDuplicatesStage: Filter pairs where cosine_sim >= 1.0 - eps
        IdentifyDuplicatesStage-->>SemanticDeduplicationWorkflow: Parquet files with duplicate IDs
    end
    
    SemanticDeduplicationWorkflow-->>User: Results dict with timing, duplicate counts

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T14:26:19Z

docs/curate-video/process-data/dedup.md

+# Run with XennaExecutor (GPU-accelerated)
+executor = XennaExecutor()
+results = workflow.run(executor)


The example shows passing a single executor to workflow.run(), but the actual method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None).

When you pass executor as the first positional argument, it will be assigned to kmeans_executor, not pairwise_executor. However, the workflow explicitly requires kmeans_executor to be a RayActorPoolExecutor (see lines 343-345 of nemo_curator/stages/deduplication/semantic/workflow.py). Passing XennaExecutor() as kmeans_executor will cause a runtime error.

The correct usage should be:

Suggested change

# Run with XennaExecutor (GPU-accelerated)

executor = XennaExecutor()

results = workflow.run(executor)

# Run with XennaExecutor (GPU-accelerated)

from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor

kmeans_executor = RayActorPoolExecutor()

pairwise_executor = XennaExecutor()

results = workflow.run(kmeans_executor, pairwise_executor)

Or, to use defaults (which automatically set kmeans to RayActorPoolExecutor and pairwise to XennaExecutor):

Suggested change

# Run with XennaExecutor (GPU-accelerated)

executor = XennaExecutor()

results = workflow.run(executor)

# Run with default executors (RayActorPoolExecutor for kmeans, XennaExecutor for pairwise)

results = workflow.run()

Signed-off-by: Ao Tang <[email protected]>

suiyoubi · 2026-01-13T20:04:51Z

/ok to test 05c5b8b

greptile-apps

Additional Comments (2)

docs/curate-video/process-data/dedup.md, line 306-307 (link)

style: This section ends abruptly. The sentence "The removal process depends on how you want to persist and shard your data:" suggests there should be follow-up content (examples showing how to filter duplicate IDs when exporting, using pandas/dask, WebDataset, etc.). Consider adding concrete examples of how to apply the duplicate filtering to common export scenarios.
docs/curate-video/load-data/index.md, line 41-52 (link)

style: The removed "Local Paths" and "Remote Paths" sections provided helpful guidance on configuring S3 credentials. Consider adding back a tip about credential configuration for cloud storage access, as this is critical for users working with S3, GCS, or other remote storage.

_{4 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T20:08:31Z

docs/curate-video/process-data/dedup.md

+# Run with XennaExecutor (GPU-accelerated)
+executor = XennaExecutor()
+results = workflow.run(executor)


syntax: The example passes executor as a single positional argument, but workflow.run() expects named parameters: kmeans_executor and pairwise_executor. When passed positionally, executor is assigned to kmeans_executor, which must be a RayActorPoolExecutor (not XennaExecutor). This will cause a runtime error.

Suggested change

# Run with XennaExecutor (GPU-accelerated)

executor = XennaExecutor()

results = workflow.run(executor)

# Run with XennaExecutor for pairwise (GPU-accelerated)

executor = XennaExecutor()

results = workflow.run(pairwise_executor=executor)

Signed-off-by: Lawrence Lane <[email protected]>

greptile-apps

_{10 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T20:31:54Z

docs/curate-video/process-data/dedup.md

-)
+# Run with XennaExecutor (GPU-accelerated)
+executor = XennaExecutor()
+results = workflow.run(executor)


syntax: Incorrect API usage - the workflow.run() method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None). When passing executor as a single positional argument, it is assigned to kmeans_executor, but SemanticDeduplicationWorkflow requires kmeans_executor to be a RayActorPoolExecutor, not XennaExecutor(). This will cause a runtime error.

The correct approach is to pass both executors with named parameters:

Suggested change

results = workflow.run(executor)

results = workflow.run(kmeans_executor=RayActorPoolExecutor(), pairwise_executor=executor)

Or rely on the defaults (kmeans defaults to RayActorPoolExecutor, pairwise defaults to XennaExecutor):

Suggested change

results = workflow.run(executor)

results = workflow.run(pairwise_executor=executor)

greptile-apps · 2026-01-13T20:31:54Z

docs/curate-video/process-data/dedup.md

+
+The duplicate identification stages (`IdentifyDuplicatesStage` or `SemanticDeduplicationWorkflow` with `eps` specified) write Parquet files containing duplicate clip IDs to the output directory (typically `output_path/duplicates/`). These files contain a single column `id` with the IDs of clips that should be removed.
+
+**It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset.** The removal process depends on how you want to persist and shard your data:


logic: The "Removing Duplicates" section ends incomplete. The sentence states: "The removal process depends on how you want to persist and shard your data:" but provides no follow-up content. This leaves users without instructions on how to actually filter duplicate IDs from their final dataset. Consider adding concrete examples or instructions for common export scenarios (e.g., using pandas/dask to filter parquet files, filtering during WebDataset creation, etc.).

greptile-apps

_{10 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T20:35:11Z

docs/curate-video/process-data/dedup.md

+# Run with XennaExecutor (GPU-accelerated)
+executor = XennaExecutor()
+results = workflow.run(executor)


syntax: The workflow.run() method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None). When passing executor as a single positional argument, it is assigned to kmeans_executor, but the workflow requires kmeans_executor to be a RayActorPoolExecutor, not XennaExecutor(). This will cause a runtime error: "kmeans_executor must be an instance of RayActorPoolExecutor."

Suggested change

# Run with XennaExecutor (GPU-accelerated)

executor = XennaExecutor()

results = workflow.run(executor)

# Run with RayActorPoolExecutor for kmeans and XennaExecutor for pairwise (GPU-accelerated)

from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor

kmeans_executor = RayActorPoolExecutor()

pairwise_executor = XennaExecutor()

results = workflow.run(kmeans_executor=kmeans_executor, pairwise_executor=pairwise_executor)

praateekmahajan and others added 3 commits November 19, 2025 12:57

..

ef7c3c1

Signed-off-by: Praateek <[email protected]>

Merge branch 'main' into praateek/video-docs-update

7c687e8

fix closing directives

dbf6442

Signed-off-by: Ao Tang <[email protected]>

Merge branch 'main' into pr-1248

c8f7aa9

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

fixing directives

05c5b8b

Signed-off-by: Ao Tang <[email protected]>

copy-pr-bot bot temporarily deployed to test January 13, 2026 20:05 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 13, 2026 20:05 Inactive

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

build fixes

05b5e6c

Signed-off-by: Lawrence Lane <[email protected]>

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

Merge branch 'main' into pr-1248

0c5947d

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

lbliii mentioned this pull request Jan 13, 2026

Video Docs Update - Pass 1 #1248

Draft

3 tasks

-# Run with XennaExecutor (GPU-accelerated)
-executor = XennaExecutor()
-results = workflow.run(executor)
+# Run the workflow (uses default executors: RayActorPoolExecutor for K-means, XennaExecutor for pairwise)
+results = workflow.run()
+# Or explicitly specify executors:
+# from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor
+# results = workflow.run(kmeans_executor=RayActorPoolExecutor(), pairwise_executor=XennaExecutor())


		The duplicate identification stages (`IdentifyDuplicatesStage` or `SemanticDeduplicationWorkflow` with `eps` specified) write Parquet files containing duplicate clip IDs to the output directory (typically `output_path/duplicates/`). These files contain a single column `id` with the IDs of clips that should be removed.

		It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset. The removal process depends on how you want to persist and shard your data:

	results = workflow.run(executor)
	results = workflow.run(kmeans_executor=RayActorPoolExecutor(), pairwise_executor=executor)

	results = workflow.run(executor)
	results = workflow.run(pairwise_executor=executor)

Video tutorial improve #1367

Are you sure you want to change the base?

Video tutorial improve #1367

Uh oh!

Conversation

suiyoubi commented Jan 13, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Jan 13, 2026

Uh oh!

suiyoubi commented Jan 13, 2026

Uh oh!

greptile-apps bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 13, 2026

Greptile Overview

Greptile Summary

Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

suiyoubi commented Jan 13, 2026

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (2)

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps bot commented Jan 13, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading