-
Notifications
You must be signed in to change notification settings - Fork 210
Video tutorial improve #1367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Video tutorial improve #1367
Conversation
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
|
/ok to test c8f7aa9 |
Greptile SummaryThis PR improves video tutorial documentation with extensive enhancements to the semantic deduplication guide, better organization of video loading documentation, and corrections to broken reference links across multiple modalities. Key Changes:
Critical Issue Found:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Docs as Documentation
participant Workflow as SemanticDeduplicationWorkflow
participant KMeans as KMeansStage
participant Pairwise as PairwiseStage
participant Identify as IdentifyDuplicatesStage
User->>Docs: Reads dedup.md for API usage
Docs->>User: Shows workflow.run(executor) example
User->>Workflow: Calls workflow.run(XennaExecutor())
Note over Workflow: ❌ ERROR: XennaExecutor assigned to<br/>kmeans_executor parameter<br/>Expects RayActorPoolExecutor
Note over User,Workflow: CORRECT USAGE:
User->>Workflow: Calls workflow.run(kmeans_executor=..., pairwise_executor=...)
Workflow->>KMeans: Passes RayActorPoolExecutor
Workflow->>Pairwise: Passes XennaExecutor
KMeans->>Pairwise: K-means output
Pairwise->>Identify: Similarity scores
Identify->>User: Duplicate IDs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
| # Run with XennaExecutor (GPU-accelerated) | ||
| executor = XennaExecutor() | ||
| results = workflow.run(executor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect API usage: The workflow.run() method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None), but this example passes a single executor parameter.
When you call workflow.run(executor), the executor is interpreted as kmeans_executor. However, the SemanticDeduplicationWorkflow requires kmeans_executor to be a RayActorPoolExecutor (see line 343-345 of the workflow.py implementation). Passing XennaExecutor() will cause a runtime error: "kmeans_executor must be an instance of RayActorPoolExecutor."
Fix: Update the example to correctly specify both executors or rely on the defaults:
| # Run with XennaExecutor (GPU-accelerated) | |
| executor = XennaExecutor() | |
| results = workflow.run(executor) | |
| # Run the workflow (uses default executors: RayActorPoolExecutor for K-means, XennaExecutor for pairwise) | |
| results = workflow.run() | |
| # Or explicitly specify executors: | |
| # from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor | |
| # results = workflow.run(kmeans_executor=RayActorPoolExecutor(), pairwise_executor=XennaExecutor()) |
Alternatively, if you want to keep showing an explicit executor call, the documentation should clarify the correct parameter names and requirements.
|
|
||
| The duplicate identification stages (`IdentifyDuplicatesStage` or `SemanticDeduplicationWorkflow` with `eps` specified) write Parquet files containing duplicate clip IDs to the output directory (typically `output_path/duplicates/`). These files contain a single column `id` with the IDs of clips that should be removed. | ||
|
|
||
| **It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset.** The removal process depends on how you want to persist and shard your data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete section: The "Removing Duplicates" section ends with a colon but provides no follow-up content. The sentence reads: "The removal process depends on how you want to persist and shard your data:" but nothing follows.
This appears to be incomplete documentation that should explain how users can actually filter out the duplicate IDs when exporting their final dataset. Consider adding concrete examples or instructions for common export scenarios (e.g., using pandas/dask to filter parquet files, filtering during WebDataset creation, etc.).
Greptile OverviewGreptile SummarySummaryThis PR improves video curation documentation by expanding the semantic deduplication guide and updating various cross-references. Key Changes:
Issue Found: The new Confidence Score: 3/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant SemanticDeduplicationWorkflow
participant KMeansStage
participant PairwiseStage
participant IdentifyDuplicatesStage
participant RayActorPoolExecutor
participant XennaExecutor
User->>SemanticDeduplicationWorkflow: Initialize with input_path, n_clusters, eps, etc.
User->>SemanticDeduplicationWorkflow: run(kmeans_executor, pairwise_executor)
Note over SemanticDeduplicationWorkflow: Setup directories<br/>(kmeans_results, pairwise_results, duplicates)
SemanticDeduplicationWorkflow->>KMeansStage: Create stage with n_clusters, embedding_dim, etc.
SemanticDeduplicationWorkflow->>RayActorPoolExecutor: execute(KMeansStage)
RayActorPoolExecutor->>KMeansStage: Process embeddings
KMeansStage-->>SemanticDeduplicationWorkflow: Clustered embeddings with centroid distances
SemanticDeduplicationWorkflow->>PairwiseStage: Create stage with ranking_strategy, batch_size
SemanticDeduplicationWorkflow->>XennaExecutor: execute(PairwiseStage)
XennaExecutor->>PairwiseStage: Compute within-cluster similarity
PairwiseStage-->>SemanticDeduplicationWorkflow: Pairwise similarity scores
alt eps is specified
SemanticDeduplicationWorkflow->>IdentifyDuplicatesStage: Create stage with eps threshold
SemanticDeduplicationWorkflow->>XennaExecutor: execute(IdentifyDuplicatesStage)
XennaExecutor->>IdentifyDuplicatesStage: Filter pairs where cosine_sim >= 1.0 - eps
IdentifyDuplicatesStage-->>SemanticDeduplicationWorkflow: Parquet files with duplicate IDs
end
SemanticDeduplicationWorkflow-->>User: Results dict with timing, duplicate counts
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
| # Run with XennaExecutor (GPU-accelerated) | ||
| executor = XennaExecutor() | ||
| results = workflow.run(executor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example shows passing a single executor to workflow.run(), but the actual method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None).
When you pass executor as the first positional argument, it will be assigned to kmeans_executor, not pairwise_executor. However, the workflow explicitly requires kmeans_executor to be a RayActorPoolExecutor (see lines 343-345 of nemo_curator/stages/deduplication/semantic/workflow.py). Passing XennaExecutor() as kmeans_executor will cause a runtime error.
The correct usage should be:
| # Run with XennaExecutor (GPU-accelerated) | |
| executor = XennaExecutor() | |
| results = workflow.run(executor) | |
| # Run with XennaExecutor (GPU-accelerated) | |
| from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor | |
| kmeans_executor = RayActorPoolExecutor() | |
| pairwise_executor = XennaExecutor() | |
| results = workflow.run(kmeans_executor, pairwise_executor) |
Or, to use defaults (which automatically set kmeans to RayActorPoolExecutor and pairwise to XennaExecutor):
| # Run with XennaExecutor (GPU-accelerated) | |
| executor = XennaExecutor() | |
| results = workflow.run(executor) | |
| # Run with default executors (RayActorPoolExecutor for kmeans, XennaExecutor for pairwise) | |
| results = workflow.run() |
Signed-off-by: Ao Tang <[email protected]>
|
/ok to test 05c5b8b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
docs/curate-video/process-data/dedup.md, line 306-307 (link)style: This section ends abruptly. The sentence "The removal process depends on how you want to persist and shard your data:" suggests there should be follow-up content (examples showing how to filter duplicate IDs when exporting, using pandas/dask, WebDataset, etc.). Consider adding concrete examples of how to apply the duplicate filtering to common export scenarios.
-
docs/curate-video/load-data/index.md, line 41-52 (link)style: The removed "Local Paths" and "Remote Paths" sections provided helpful guidance on configuring S3 credentials. Consider adding back a tip about credential configuration for cloud storage access, as this is critical for users working with S3, GCS, or other remote storage.
4 files reviewed, 3 comments
| # Run with XennaExecutor (GPU-accelerated) | ||
| executor = XennaExecutor() | ||
| results = workflow.run(executor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: The example passes executor as a single positional argument, but workflow.run() expects named parameters: kmeans_executor and pairwise_executor. When passed positionally, executor is assigned to kmeans_executor, which must be a RayActorPoolExecutor (not XennaExecutor). This will cause a runtime error.
| # Run with XennaExecutor (GPU-accelerated) | |
| executor = XennaExecutor() | |
| results = workflow.run(executor) | |
| # Run with XennaExecutor for pairwise (GPU-accelerated) | |
| executor = XennaExecutor() | |
| results = workflow.run(pairwise_executor=executor) |
Signed-off-by: Lawrence Lane <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 files reviewed, 2 comments
| ) | ||
| # Run with XennaExecutor (GPU-accelerated) | ||
| executor = XennaExecutor() | ||
| results = workflow.run(executor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Incorrect API usage - the workflow.run() method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None). When passing executor as a single positional argument, it is assigned to kmeans_executor, but SemanticDeduplicationWorkflow requires kmeans_executor to be a RayActorPoolExecutor, not XennaExecutor(). This will cause a runtime error.
The correct approach is to pass both executors with named parameters:
| results = workflow.run(executor) | |
| results = workflow.run(kmeans_executor=RayActorPoolExecutor(), pairwise_executor=executor) |
Or rely on the defaults (kmeans defaults to RayActorPoolExecutor, pairwise defaults to XennaExecutor):
| results = workflow.run(executor) | |
| results = workflow.run(pairwise_executor=executor) |
|
|
||
| The duplicate identification stages (`IdentifyDuplicatesStage` or `SemanticDeduplicationWorkflow` with `eps` specified) write Parquet files containing duplicate clip IDs to the output directory (typically `output_path/duplicates/`). These files contain a single column `id` with the IDs of clips that should be removed. | ||
|
|
||
| **It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset.** The removal process depends on how you want to persist and shard your data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The "Removing Duplicates" section ends incomplete. The sentence states: "The removal process depends on how you want to persist and shard your data:" but provides no follow-up content. This leaves users without instructions on how to actually filter duplicate IDs from their final dataset. Consider adding concrete examples or instructions for common export scenarios (e.g., using pandas/dask to filter parquet files, filtering during WebDataset creation, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 files reviewed, 1 comment
| # Run with XennaExecutor (GPU-accelerated) | ||
| executor = XennaExecutor() | ||
| results = workflow.run(executor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: The workflow.run() method signature is run(self, kmeans_executor: BaseExecutor | None = None, pairwise_executor: BaseExecutor | None = None). When passing executor as a single positional argument, it is assigned to kmeans_executor, but the workflow requires kmeans_executor to be a RayActorPoolExecutor, not XennaExecutor(). This will cause a runtime error: "kmeans_executor must be an instance of RayActorPoolExecutor."
| # Run with XennaExecutor (GPU-accelerated) | |
| executor = XennaExecutor() | |
| results = workflow.run(executor) | |
| # Run with RayActorPoolExecutor for kmeans and XennaExecutor for pairwise (GPU-accelerated) | |
| from nemo_curator.backends.experimental.ray_actor_pool import RayActorPoolExecutor | |
| kmeans_executor = RayActorPoolExecutor() | |
| pairwise_executor = XennaExecutor() | |
| results = workflow.run(kmeans_executor=kmeans_executor, pairwise_executor=pairwise_executor) |
Description
Usage
# Add snippet demonstrating usageChecklist