Commit e918374
[tt-train] Add training log comparison plotting script (#37531)
## Changes:
- Add `scripts/plot_training_comparison.py` for comparing training runs
- Add `docs/TRAINING_LOG_COMPARISON.md` with usage documentation
- Supports comparing baseline vs optimized runs with loss/step time
plots
- Generates summary statistics with speedup calculations
### Problem description
When developing kernel optimizations (e.g., operation fusion, memory
layout changes), we need to verify two critical properties:
1. **Correctness**: The optimization doesn't degrade training quality
(loss curves should match baseline)
2. **Performance**: The optimization improves execution time (step time
should decrease)
Previously, this analysis required manual log parsing and plotting,
making it difficult to systematically compare multiple optimization
strategies or track improvements across iterations.
### What's changed
Added a general-purpose training log comparison tool in
`tt-train/scripts/`:
**Script features:**
- Parses tt-train binary logs (e.g., `nano_gpt`) for loss values and
step times
- Generates three comparison plots:
- `losses.png` - overlaid loss curves for all runs
- `losses_diff.png` - loss difference relative to baseline
- `step_time.png` - per-step execution time comparison
- Prints summary statistics: mean step time, speedup factors, final loss
- Supports multiple runs with custom labels and configurable warmup
periods
**Usage:**
```bash
python scripts/plot_training_comparison.py \
--baseline run_baseline.txt \
--compare run_optimized.txt run_fused.txt \
--labels baseline optimized fused \
--output-dir ./plots
```
**Documentation:**
- Added `TRAINING_LOG_COMPARISON.md` with:
- Quick start guide
- Log format requirements
- Usage examples for kernel optimization workflows
- Output interpretation guidelines
### Example: RMSNorm Composite vs Fused (NanoLlama, 1000 steps,
Shakespeare)
```
$ python tt-train/scripts/plot_training_comparison.py \
--baseline rmsnorm_composite.txt \
--compare rmsnorm_fused.txt \
--labels "RMSNorm Composite" "RMSNorm Fused" \
--output-dir ./plots --title-prefix "NanoLlama RMSNorm: "
Parsing log files...
RMSNorm Composite: 1000 loss values, 984 step times
RMSNorm Fused: 1000 loss values, 984 step times
============================================================
SUMMARY STATISTICS
============================================================
Mean Step Times:
RMSNorm Composite: 242.87 ms (std: 2.14 ms)
RMSNorm Fused: 217.91 ms (std: 1.35 ms)
Speedup relative to 'RMSNorm Composite':
RMSNorm Fused: 1.115x
Final Loss (last 100 steps average):
RMSNorm Composite: 1.052969
RMSNorm Fused: 1.057578
Generating plots...
Saved: tt-train/docs/example_plots/losses.png
Saved: tt-train/docs/example_plots/step_time.png
Saved: tt-train/docs/example_plots/losses_diff.png
Done!
```
**Loss Comparison:**
<img width="2457" height="1314" alt="losses"
src="https://github.com/user-attachments/assets/88d0220b-dd0b-41e1-b31a-81a736ff6246"
/>
**Loss Difference (Fused vs Composite):**
<img width="2499" height="1314" alt="losses_diff"
src="https://github.com/user-attachments/assets/79df484e-53c5-4568-a396-640ffa5c3b3f"
/>
**Step Time Comparison:**
<img width="2466" height="1314" alt="step_time"
src="https://github.com/user-attachments/assets/55934c80-c6d0-46bb-a0e3-b7afb7574132"
/>
### Checklist
- [ ] [](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml?query=branch:mdragula/training-log-analysis-plots)
- [ ] [](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml?query=branch:mdragula/training-log-analysis-plots)
- [ ]
[](https://github.com/tenstorrent/tt-metal/actions/workflows/tt-train-nightly.yaml?query=branch:mdragula/training-log-analysis-plots)
- [x] Documentation-only change (no functional code changes)
#### Notes
This is a developer tooling/documentation addition. No model tests
required as this only adds analysis scripts for comparing training logs.
---------
Co-authored-by: Cursor <cursoragent@cursor.com>1 parent cbc4214 commit e918374
File tree
2 files changed
+533
-0
lines changed- tt-train
- docs
- scripts
2 files changed
+533
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
0 commit comments