Benchmark: Model benchmark - deterministic training support#731
Benchmark: Model benchmark - deterministic training support#731Aishwarya-Tonpe wants to merge 3 commits intomainfrom
Conversation
@microsoft-github-policy-service agree company="Microsoft" |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #731 +/- ##
==========================================
- Coverage 85.70% 85.69% -0.02%
==========================================
Files 102 103 +1
Lines 7703 7890 +187
==========================================
+ Hits 6602 6761 +159
- Misses 1101 1129 +28
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks for addressing all the comments, since this is a big PR, could we do an apple-2-apple comparision before merging this PR. For example,
|
Tested and compared all the 3 items listed above. Looks good. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f2c7554 to
f831f73
Compare
f831f73 to
840c62f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py
Outdated
Show resolved
Hide resolved
840c62f to
181b9ad
Compare
181b9ad to
20c1fac
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
20c1fac to
2803619
Compare
2803619 to
34689f9
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
34689f9 to
c163ddb
Compare
c163ddb to
b5ad62a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b5ad62a to
a6ce77c
Compare
a6ce77c to
2b52174
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
abuccts
left a comment
There was a problem hiding this comment.
pls revert unrelated changes in this PR, e.g., third_party/gpu-burn
2b52174 to
cb1f50b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| torch.use_deterministic_algorithms(True, warn_only=False) | ||
| torch.backends.cudnn.deterministic = True | ||
| torch.backends.cudnn.benchmark = False | ||
| # Disable TF32 to remove potential numerical variability | ||
| try: |
There was a problem hiding this comment.
Enabling determinism here mutates global PyTorch state (use_deterministic_algorithms, cuDNN deterministic/benchmark, TF32/SDP backend flags) but the previous values are never restored. Because SuperBench launches benchmarks sequentially in the same Python process, this can unintentionally affect later benchmarks/tests that did not request determinism (performance changes or unexpected deterministic-op errors). Consider saving the prior backend settings on enable and restoring them in _postprocess() so determinism is scoped to the benchmark run.
| sample = self._raw_data_df[metric].iloc[0] | ||
| if isinstance(sample, float): | ||
| # Keep full precision for deterministic metrics to avoid false positives in diagnosis | ||
| if 'deterministic' in metric: | ||
| return float(val) |
There was a problem hiding this comment.
Type-checking against built-in float/int here won't match common pandas scalar types (e.g., numpy.float64/numpy.int64), so the rounding/formatting branch may never run in practice (and the special-case for deterministic metrics may be bypassed unintentionally). Consider checking against numbers.Real / numbers.Integral (or pandas.api.types / numpy.floating & numpy.integer) instead of float/int so formatting behaves consistently for DataFrame-backed values.
| # Add _rank0 suffix to deterministic metrics for compatibility with rules | ||
| if metric.startswith('deterministic_'): |
There was a problem hiding this comment.
Deterministic metrics are being renamed by appending "_rank0" unconditionally. In non-distributed runs PytorchBase emits metrics without any rank suffix, and in distributed runs it may already include a rank suffix, so this can produce inconsistent keys (or even double-suffixed names like "..._rank0_rank0") and make baseline/diagnosis rules harder to write. Consider preserving metric names as-is, or only adding a rank suffix when the benchmark result actually contains per-rank metrics.
| # Add _rank0 suffix to deterministic metrics for compatibility with rules | |
| if metric.startswith('deterministic_'): | |
| # Add _rank0 suffix to deterministic metrics that don't already have a rank suffix | |
| if metric.startswith('deterministic_') and '_rank' not in metric: |
Adds opt-in deterministic training mode to SuperBench's PyTorch model benchmarks. When enabled --enable-determinism. PyTorch deterministic algorithms are enforced, and per-step numerical fingerprints (loss, activation means) are recorded as metrics. These can be compared across runs using the existing sb result diagnosis pipeline to verify bit-exact reproducibility — useful for hardware validation and platform comparison.
Flags added -
--enable-determinism
--check-frequency: Number of steps after which you want the metrics to be recorded
--deterministic-seed
Changes -
Updated pytorch_base.py to handle deterministic settings, logging.
Added a new example script: pytorch_deterministic_example.py
Added a test file: test_pytorch_determinism_all.py to verify everything works as expected.
Usage -
Step 1: Run 1 - Run with --enable-determinism and the necessary metrics will be recorded in the results-summary.jsonl file
Step 2: Generate the baseline file from the Run 1 results using - sb result generate-baseline
Step 3: Run 2 - Run with --enable-determinism and the necessary metrics will be recorded in the results-summary.jsonl file on a different machine (or the same machine)
Step 4: Run diagnosis on the results generated from the 2 runs using the - sb result diagnosis command
Note -