Skip to content

Updating SDE comparison mode to accept two manifests#15500

Open
Jorjeous wants to merge 18 commits intomainfrom
SDE_NC_Afeat
Open

Updating SDE comparison mode to accept two manifests#15500
Jorjeous wants to merge 18 commits intomainfrom
SDE_NC_Afeat

Conversation

@Jorjeous
Copy link
Member

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

karpnv and others added 18 commits January 27, 2026 18:13
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
…. Updaetd logging system

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
… sharding with separate numeration.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
import operator
import os
import pickle
import tarfile

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'tarfile' is not used.

Copilot Autofix

AI 5 days ago

To fix an unused import, you remove the import statement (or, if the symbol is meant to be used, you instead add corresponding code that uses it). Here, the simplest and safest way, without changing functionality, is to delete the import tarfile line.

Concretely, in tools/speech_data_explorer/data_explorer.py, remove line 27 containing import tarfile. No additional methods, imports, or definitions are needed. All other imports and code remain unchanged.

Suggested changeset 1
tools/speech_data_explorer/data_explorer.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tools/speech_data_explorer/data_explorer.py b/tools/speech_data_explorer/data_explorer.py
--- a/tools/speech_data_explorer/data_explorer.py
+++ b/tools/speech_data_explorer/data_explorer.py
@@ -24,7 +24,6 @@
 import operator
 import os
 import pickle
-import tarfile
 import tempfile
 from collections import defaultdict
 from os.path import expanduser
EOF
@@ -24,7 +24,6 @@
import operator
import os
import pickle
import tarfile
import tempfile
from collections import defaultdict
from os.path import expanduser
Copilot is powered by AI and may make mistakes. Always verify output.
for line in lines[1:]: # Skip header line
parts = line.split()
if len(parts) >= 4:
file_type = parts[0]

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable file_type is not used.

Copilot Autofix

AI 5 days ago

In general, to fix an unused local variable you either remove the variable entirely or rename it to a clearly “unused” name (such as _ or unused_file_type) if it is kept for documentation or unpacking purposes. This preserves code clarity and avoids misleading readers into thinking the value is relevant to the logic.

Here, the right-hand side (parts[0]) has no side effects, so we can safely remove just the file_type assignment without affecting behavior. The best minimal fix is to delete the file_type = parts[0] line, leaving offset, size, and filename unchanged. No additional imports or definitions are needed, and no other parts of parse_dali_index depend on file_type. All edits occur in tools/speech_data_explorer/data_explorer.py within the shown parse_dali_index function.

Suggested changeset 1
tools/speech_data_explorer/data_explorer.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tools/speech_data_explorer/data_explorer.py b/tools/speech_data_explorer/data_explorer.py
--- a/tools/speech_data_explorer/data_explorer.py
+++ b/tools/speech_data_explorer/data_explorer.py
@@ -277,7 +277,6 @@
     for line in lines[1:]:  # Skip header line
         parts = line.split()
         if len(parts) >= 4:
-            file_type = parts[0]
             offset = int(parts[1])
             size = int(parts[2])
             filename = parts[3]
EOF
@@ -277,7 +277,6 @@
for line in lines[1:]: # Skip header line
parts = line.split()
if len(parts) >= 4:
file_type = parts[0]
offset = int(parts[1])
size = int(parts[2])
filename = parts[3]
Copilot is powered by AI and may make mistakes. Always verify output.

# Handle dual-manifest mode: merge the two manifests into one temp manifest
# and rewrite names_compared to use the auto-generated pred_text_{name} field names.
_merged_tmp = None # keep reference so temp file is not deleted

Check notice

Code scanning / CodeQL

Unused global variable Note

The global variable '_merged_tmp' is not used.

Copilot Autofix

AI 5 days ago

To fix this while preserving behavior, keep the assignment (so the temp file object stays referenced) but rename the variable to follow the project’s “intentionally unused” naming pattern. That makes it clear that the variable is only there for its side effect (controlling object lifetime) and satisfies CodeQL.

Concretely, in tools/speech_data_explorer/data_explorer.py around line 1311, change _merged_tmp to something like _unused_merged_tmp_ref both in its declaration and where the return from merge_manifests is unpacked. No imports or additional definitions are needed. Functionality is unchanged: the temporary file’s reference is still kept alive; only the variable name changes.

Suggested changeset 1
tools/speech_data_explorer/data_explorer.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tools/speech_data_explorer/data_explorer.py b/tools/speech_data_explorer/data_explorer.py
--- a/tools/speech_data_explorer/data_explorer.py
+++ b/tools/speech_data_explorer/data_explorer.py
@@ -1308,10 +1308,10 @@
 
 # Handle dual-manifest mode: merge the two manifests into one temp manifest
 # and rewrite names_compared to use the auto-generated pred_text_{name} field names.
-_merged_tmp = None  # keep reference so temp file is not deleted
+_unused_merged_tmp_ref = None  # keep reference so temp file is not deleted
 if dual_manifest_mode:
     model_name_1, model_name_2 = args.names_compared
-    merged_manifest_path, _merged_tmp = merge_manifests(args.manifest[0], args.manifest[1], model_name_1, model_name_2)
+    merged_manifest_path, _unused_merged_tmp_ref = merge_manifests(args.manifest[0], args.manifest[1], model_name_1, model_name_2)
     data_filename = merged_manifest_path
     args.names_compared = [f'pred_text_{model_name_1}', f'pred_text_{model_name_2}']
     logging.info(f"Dual-manifest mode: using merged manifest at {merged_manifest_path}")
EOF
@@ -1308,10 +1308,10 @@

# Handle dual-manifest mode: merge the two manifests into one temp manifest
# and rewrite names_compared to use the auto-generated pred_text_{name} field names.
_merged_tmp = None # keep reference so temp file is not deleted
_unused_merged_tmp_ref = None # keep reference so temp file is not deleted
if dual_manifest_mode:
model_name_1, model_name_2 = args.names_compared
merged_manifest_path, _merged_tmp = merge_manifests(args.manifest[0], args.manifest[1], model_name_1, model_name_2)
merged_manifest_path, _unused_merged_tmp_ref = merge_manifests(args.manifest[0], args.manifest[1], model_name_1, model_name_2)
data_filename = merged_manifest_path
args.names_compared = [f'pred_text_{model_name_1}', f'pred_text_{model_name_2}']
logging.info(f"Dual-manifest mode: using merged manifest at {merged_manifest_path}")
Copilot is powered by AI and may make mistakes. Always verify output.
@Jorjeous Jorjeous requested review from andrusenkoau and karpnv March 16, 2026 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants