Benchmark: Model benchmark - deterministic training support by Aishwarya-Tonpe · Pull Request #731 · microsoft/superbenchmark

Aishwarya-Tonpe · 2025-08-28T17:41:54Z

Adds opt-in deterministic training mode to SuperBench's PyTorch model benchmarks. When enabled --enable-determinism. PyTorch deterministic algorithms are enforced, and per-step numerical fingerprints (loss, activation means) are recorded as metrics. These can be compared across runs using the existing sb result diagnosis pipeline to verify bit-exact reproducibility — useful for hardware validation and platform comparison.

Flags added -

--enable-determinism
--check-frequency: Number of steps after which you want the metrics to be recorded
--deterministic-seed

Changes -

Updated pytorch_base.py to handle deterministic settings, logging.
Added a new example script: pytorch_deterministic_example.py
Added a test file: test_pytorch_determinism_all.py to verify everything works as expected.

Usage -

Step 1: Run 1 - Run with --enable-determinism and the necessary metrics will be recorded in the results-summary.jsonl file
Step 2: Generate the baseline file from the Run 1 results using - sb result generate-baseline
Step 3: Run 2 - Run with --enable-determinism and the necessary metrics will be recorded in the results-summary.jsonl file on a different machine (or the same machine)
Step 4: Run diagnosis on the results generated from the 2 runs using the - sb result diagnosis command

Note -

Make sure all the parameters are constant between the 2 runs
Running the diagnosis command requires the rules.yaml file

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

Aishwarya-Tonpe · 2025-08-28T19:55:34Z

@Aishwarya-Tonpe please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

codecov · 2025-08-29T17:23:00Z

Codecov Report

❌ Patch coverage is 83.89513% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.69%. Comparing base (6b8e810) to head (ab17d1c).

Files with missing lines	Patch %	Lines
...rbench/benchmarks/model_benchmarks/pytorch_base.py	84.54%	17 Missing ⚠️
superbench/common/model_log_utils.py	76.74%	10 Missing ⚠️
superbench/analyzer/baseline_generation.py	52.94%	8 Missing ⚠️
...enchmarks/model_benchmarks/pytorch_mixtral_impl.py	82.85%	6 Missing ⚠️
...perbench/benchmarks/model_benchmarks/model_base.py	66.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #731      +/-   ##
==========================================
- Coverage   85.70%   85.69%   -0.02%     
==========================================
  Files         102      103       +1     
  Lines        7703     7890     +187     
==========================================
+ Hits         6602     6761     +159     
- Misses       1101     1129      +28

Flag	Coverage Δ
cpu-python3.10-unit-test	`70.42% <42.48%> (-0.54%)`	⬇️
cpu-python3.12-unit-test	`70.42% <42.48%> (-0.54%)`	⬇️
cpu-python3.7-unit-test	`69.85% <40.82%> (-0.59%)`	⬇️
cuda-unit-test	`83.60% <82.70%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

examples/benchmarks/pytorch_deterministic_example.py

superbench/benchmarks/base.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

superbench/benchmarks/model_benchmarks/pytorch_bert.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

superbench/benchmarks/model_benchmarks/pytorch_bert.py

superbench/benchmarks/base.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

guoshzhao · 2025-10-09T17:55:15Z

Thanks for addressing all the comments, since this is a big PR, could we do an apple-2-apple comparision before merging this PR. For example,

Run all e2e model benchmark based on main branch.
Run all e2e model benchmark based on this branch with deterministic training disabled.
Run all e2e model benchmark based on this branch with deterministic training enabled.
And compare if throughput metrics are expected?

Aishwarya-Tonpe · 2025-10-13T19:18:54Z

Thanks for addressing all the comments, since this is a big PR, could we do an apple-2-apple comparision before merging this PR. For example,

Run all e2e model benchmark based on main branch.

Run all e2e model benchmark based on this branch with deterministic training disabled.

Run all e2e model benchmark based on this branch with deterministic training enabled.
And compare if throughput metrics are expected?

Tested and compared all the 3 items listed above. Looks good.
Can share the result files if needed, please lmk. thank you!

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/model_benchmarks/pytorch_base.py

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

examples/benchmarks/pytorch_deterministic_example.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

examples/benchmarks/pytorch_deterministic_example.py

superbench/common/model_log_utils.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

superbench/common/model_log_utils.py

examples/benchmarks/pytorch_deterministic_example.py

superbench/analyzer/baseline_generation.py

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/analyzer/data_diagnosis.py

superbench/analyzer/baseline_generation.py

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

examples/benchmarks/pytorch_deterministic_example.py

docs/user-tutorial/benchmarks/model-benchmarks.md

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/model_benchmarks/pytorch_base.py

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

abuccts

pls revert unrelated changes in this PR, e.g., third_party/gpu-burn

…floating point

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-19T22:06:36Z

superbench/benchmarks/model_benchmarks/pytorch_base.py

+        torch.use_deterministic_algorithms(True, warn_only=False)
+        torch.backends.cudnn.deterministic = True
+        torch.backends.cudnn.benchmark = False
+        # Disable TF32 to remove potential numerical variability
+        try:


Enabling determinism here mutates global PyTorch state (use_deterministic_algorithms, cuDNN deterministic/benchmark, TF32/SDP backend flags) but the previous values are never restored. Because SuperBench launches benchmarks sequentially in the same Python process, this can unintentionally affect later benchmarks/tests that did not request determinism (performance changes or unexpected deterministic-op errors). Consider saving the prior backend settings on enable and restoring them in _postprocess() so determinism is scoped to the benchmark run.

Copilot · 2026-03-19T22:06:36Z

superbench/analyzer/baseline_generation.py

+        sample = self._raw_data_df[metric].iloc[0]
+        if isinstance(sample, float):
+            # Keep full precision for deterministic metrics to avoid false positives in diagnosis
+            if 'deterministic' in metric:
+                return float(val)


Type-checking against built-in float/int here won't match common pandas scalar types (e.g., numpy.float64/numpy.int64), so the rounding/formatting branch may never run in practice (and the special-case for deterministic metrics may be bypassed unintentionally). Consider checking against numbers.Real / numbers.Integral (or pandas.api.types / numpy.floating & numpy.integer) instead of float/int so formatting behaves consistently for DataFrame-backed values.

Copilot · 2026-03-19T22:06:37Z

examples/benchmarks/pytorch_deterministic_example.py

+            # Add _rank0 suffix to deterministic metrics for compatibility with rules
+            if metric.startswith('deterministic_'):


Deterministic metrics are being renamed by appending "_rank0" unconditionally. In non-distributed runs PytorchBase emits metrics without any rank suffix, and in distributed runs it may already include a rank suffix, so this can produce inconsistent keys (or even double-suffixed names like "..._rank0_rank0") and make baseline/diagnosis rules harder to write. Consider preserving metric names as-is, or only adding a rank suffix when the benchmark result actually contains per-rank metrics.

Suggested change

# Add _rank0 suffix to deterministic metrics for compatibility with rules

if metric.startswith('deterministic_'):

# Add _rank0 suffix to deterministic metrics that don't already have a rank suffix

if metric.startswith('deterministic_') and '_rank' not in metric:

Aishwarya-Tonpe requested a review from a team as a code owner August 28, 2025 17:41

github-advanced-security bot found potential problems Aug 28, 2025

View reviewed changes

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py Fixed Show fixed Hide fixed

guoshzhao assigned polarG Sep 18, 2025

guoshzhao changed the title ~~Aishwaryatonpe/deterministic training~~ Benchmark: Model benchmark - deterministic training support Sep 18, 2025

guoshzhao reviewed Sep 19, 2025

View reviewed changes

examples/benchmarks/pytorch_deterministic_example.py Outdated Show resolved Hide resolved

guoshzhao requested a review from polarG September 24, 2025 23:25