Updating SDE comparison mode to accept two manifests by Jorjeous · Pull Request #15500 · NVIDIA-NeMo/NeMo

Jorjeous · 2026-03-16T11:10:48Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

…. Updaetd logging system Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

… sharding with separate numeration. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

tools/speech_data_explorer/data_explorer.py

 import operator
 import os
 import pickle
+import tarfile


To fix an unused import, you remove the import statement (or, if the symbol is meant to be used, you instead add corresponding code that uses it). Here, the simplest and safest way, without changing functionality, is to delete the import tarfile line.

Concretely, in tools/speech_data_explorer/data_explorer.py, remove line 27 containing import tarfile. No additional methods, imports, or definitions are needed. All other imports and code remain unchanged.

tools/speech_data_explorer/data_explorer.py

+    for line in lines[1:]:  # Skip header line
+        parts = line.split()
+        if len(parts) >= 4:
+            file_type = parts[0]


In general, to fix an unused local variable you either remove the variable entirely or rename it to a clearly “unused” name (such as _ or unused_file_type) if it is kept for documentation or unpacking purposes. This preserves code clarity and avoids misleading readers into thinking the value is relevant to the logic.

Here, the right-hand side (parts[0]) has no side effects, so we can safely remove just the file_type assignment without affecting behavior. The best minimal fix is to delete the file_type = parts[0] line, leaving offset, size, and filename unchanged. No additional imports or definitions are needed, and no other parts of parse_dali_index depend on file_type. All edits occur in tools/speech_data_explorer/data_explorer.py within the shown parse_dali_index function.

tools/speech_data_explorer/data_explorer.py

+
+# Handle dual-manifest mode: merge the two manifests into one temp manifest
+# and rewrite names_compared to use the auto-generated pred_text_{name} field names.
+_merged_tmp = None  # keep reference so temp file is not deleted


To fix this while preserving behavior, keep the assignment (so the temp file object stays referenced) but rename the variable to follow the project’s “intentionally unused” naming pattern. That makes it clear that the variable is only there for its side effect (controlling object lifetime) and satisfies CodeQL.

Concretely, in tools/speech_data_explorer/data_explorer.py around line 1311, change _merged_tmp to something like _unused_merged_tmp_ref both in its declaration and where the return from merge_manifests is unpacked. No imports or additional definitions are needed. Functionality is unchanged: the temporary file’s reference is still kept alive; only the variable name changes.

karpnv and others added 18 commits January 27, 2026 18:13

read manifest from s3

fe21e7e

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Apply isort and black reformatting

9069210

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

s3cfg parameter

89a595f

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/NeMo into karpnv/sde_s3

d67ec95

file range

da895cb

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Apply isort and black reformatting

b79a0da

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

Avoid downloading of full tar, instead extracting specific audio file…

fce458b

…. Updaetd logging system Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Apply isort and black reformatting

3e5f4ec

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

Merge branch 'main' of github.com:NVIDIA/NeMo into karpnv/sde_s3

391b045

shard_index + 1

64e662f

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/NeMo into karpnv/sde_s3

0e043e3

Merge branch 'main' into karpnv/sde_s3

850fd4c

Undo latest changes, as it was dataset specific

69500f6

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/NeMo into karpnv/sde_s3

ba40e0a

update table to not fail on "non-string format", update bucketing and…

2bae4dc

… sharding with separate numeration. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Merge branch 'main' into karpnv/sde_s3

3803d5d

add ability to read two manifests in comparison mode

9ae783f

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Apply isort and black reformatting

39fb4e8

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

github-advanced-security bot found potential problems Mar 16, 2026

View reviewed changes

Jorjeous requested review from andrusenkoau and karpnv March 16, 2026 13:12

@@ -1308,10 +1308,10 @@
             # Handle dual-manifest mode: merge the two manifests into one temp manifest
             # and rewrite names_compared to use the auto-generated pred_text_{name} field names.
-            _merged_tmp = None  # keep reference so temp file is not deleted
+            _unused_merged_tmp_ref = None  # keep reference so temp file is not deleted
             if dual_manifest_mode:
                 model_name_1, model_name_2 = args.names_compared
-                merged_manifest_path, _merged_tmp = merge_manifests(args.manifest[0], args.manifest[1], model_name_1, model_name_2)
+                merged_manifest_path, _unused_merged_tmp_ref = merge_manifests(args.manifest[0], args.manifest[1], model_name_1, model_name_2)
                 data_filename = merged_manifest_path
                 args.names_compared = [f'pred_text_{model_name_1}', f'pred_text_{model_name_2}']
                 logging.info(f"Dual-manifest mode: using merged manifest at {merged_manifest_path}")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating SDE comparison mode to accept two manifests#15500

Updating SDE comparison mode to accept two manifests#15500
Jorjeous wants to merge 18 commits intomainfrom
SDE_NC_Afeat

Jorjeous commented Mar 16, 2026

Uh oh!

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jorjeous commented Mar 16, 2026

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants