Skip to content

Commit e918374

Browse files
[tt-train] Add training log comparison plotting script (#37531)
## Changes: - Add `scripts/plot_training_comparison.py` for comparing training runs - Add `docs/TRAINING_LOG_COMPARISON.md` with usage documentation - Supports comparing baseline vs optimized runs with loss/step time plots - Generates summary statistics with speedup calculations ### Problem description When developing kernel optimizations (e.g., operation fusion, memory layout changes), we need to verify two critical properties: 1. **Correctness**: The optimization doesn't degrade training quality (loss curves should match baseline) 2. **Performance**: The optimization improves execution time (step time should decrease) Previously, this analysis required manual log parsing and plotting, making it difficult to systematically compare multiple optimization strategies or track improvements across iterations. ### What's changed Added a general-purpose training log comparison tool in `tt-train/scripts/`: **Script features:** - Parses tt-train binary logs (e.g., `nano_gpt`) for loss values and step times - Generates three comparison plots: - `losses.png` - overlaid loss curves for all runs - `losses_diff.png` - loss difference relative to baseline - `step_time.png` - per-step execution time comparison - Prints summary statistics: mean step time, speedup factors, final loss - Supports multiple runs with custom labels and configurable warmup periods **Usage:** ```bash python scripts/plot_training_comparison.py \ --baseline run_baseline.txt \ --compare run_optimized.txt run_fused.txt \ --labels baseline optimized fused \ --output-dir ./plots ``` **Documentation:** - Added `TRAINING_LOG_COMPARISON.md` with: - Quick start guide - Log format requirements - Usage examples for kernel optimization workflows - Output interpretation guidelines ### Example: RMSNorm Composite vs Fused (NanoLlama, 1000 steps, Shakespeare) ``` $ python tt-train/scripts/plot_training_comparison.py \ --baseline rmsnorm_composite.txt \ --compare rmsnorm_fused.txt \ --labels "RMSNorm Composite" "RMSNorm Fused" \ --output-dir ./plots --title-prefix "NanoLlama RMSNorm: " Parsing log files... RMSNorm Composite: 1000 loss values, 984 step times RMSNorm Fused: 1000 loss values, 984 step times ============================================================ SUMMARY STATISTICS ============================================================ Mean Step Times: RMSNorm Composite: 242.87 ms (std: 2.14 ms) RMSNorm Fused: 217.91 ms (std: 1.35 ms) Speedup relative to 'RMSNorm Composite': RMSNorm Fused: 1.115x Final Loss (last 100 steps average): RMSNorm Composite: 1.052969 RMSNorm Fused: 1.057578 Generating plots... Saved: tt-train/docs/example_plots/losses.png Saved: tt-train/docs/example_plots/step_time.png Saved: tt-train/docs/example_plots/losses_diff.png Done! ``` **Loss Comparison:** <img width="2457" height="1314" alt="losses" src="https://github.com/user-attachments/assets/88d0220b-dd0b-41e1-b31a-81a736ff6246" /> **Loss Difference (Fused vs Composite):** <img width="2499" height="1314" alt="losses_diff" src="https://github.com/user-attachments/assets/79df484e-53c5-4568-a396-640ffa5c3b3f" /> **Step Time Comparison:** <img width="2466" height="1314" alt="step_time" src="https://github.com/user-attachments/assets/55934c80-c6d0-46bb-a0e3-b7afb7574132" /> ### Checklist - [ ] [![All post-commit tests](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml/badge.svg?branch=mdragula/training-log-analysis-plots)](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml?query=branch:mdragula/training-log-analysis-plots) - [ ] [![Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml/badge.svg?branch=mdragula/training-log-analysis-plots)](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml?query=branch:mdragula/training-log-analysis-plots) - [ ] [![tt-train-nightly](https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml/badge.svg?branch=mdragula/training-log-analysis-plots)](https://github.com/tenstorrent/tt-metal/actions/workflows/tt-train-nightly.yaml?query=branch:mdragula/training-log-analysis-plots) - [x] Documentation-only change (no functional code changes) #### Notes This is a developer tooling/documentation addition. No model tests required as this only adds analysis scripts for comparing training logs. --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent cbc4214 commit e918374

File tree

2 files changed

+533
-0
lines changed

2 files changed

+533
-0
lines changed
Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# Training Log Comparison Analysis
2+
3+
> **Quick start:**
4+
> 1. Run your training binary multiple times with different configurations
5+
> 2. Save the logs to text files (e.g., `baseline.txt`, `optimized.txt`)
6+
> 3. Run `python tt-train/scripts/plot_training_comparison.py --baseline baseline.txt --compare optimized.txt`
7+
8+
---
9+
10+
## Table of Contents
11+
12+
1. [Purpose](#purpose)
13+
2. [Prerequisites](#prerequisites)
14+
3. [Log Format](#log-format)
15+
4. [Usage](#usage)
16+
5. [Output](#output)
17+
6. [Example Workflow](#example-workflow)
18+
19+
---
20+
21+
## Purpose
22+
23+
This tool helps evaluate the impact of kernel optimizations, fusion strategies, or configuration changes in tt-train by:
24+
25+
- **Comparing training loss curves** across multiple runs
26+
- **Visualizing loss differences** relative to a baseline
27+
- **Analyzing step time performance** to measure speedups
28+
- **Computing summary statistics** (mean step time, speedup factors, final loss)
29+
30+
Use this when you've made changes to kernels (e.g., fusing operations) and want to verify:
31+
1. The optimization doesn't degrade training quality (loss should match or improve)
32+
2. The optimization improves performance (step time should decrease)
33+
34+
---
35+
36+
## Prerequisites
37+
38+
| Requirement | Version | Notes |
39+
|-------------|---------|-------|
40+
| Python | 3.10+ | Via `create_venv.sh` |
41+
| NumPy | - | Included in dev dependencies |
42+
| Matplotlib | - | Included in dev dependencies |
43+
44+
> **Note:** All dependencies are installed automatically when you run `create_venv.sh`.
45+
46+
---
47+
48+
## Log Format
49+
50+
The script expects log files from tt-train's main training binary (e.g., `nano_gpt`). The logs should contain lines in the following format:
51+
52+
```
53+
Step: 1, Loss: 11.0234375
54+
Full step time 703.141 ms
55+
Step: 2, Loss: 10.8765432
56+
Full step time 698.234 ms
57+
...
58+
```
59+
60+
The script extracts:
61+
- **Loss values**: From lines matching `Step: \d+, Loss: ([\d.]+)`
62+
- **Step times**: From lines matching `Full step time ([\d.]+) ms`
63+
64+
---
65+
66+
## Usage
67+
68+
### Basic Comparison
69+
70+
Compare a baseline run against an optimized version:
71+
72+
```bash
73+
python tt-train/scripts/plot_training_comparison.py \
74+
--baseline run_baseline.txt \
75+
--compare run_optimized.txt
76+
```
77+
78+
### Multiple Comparisons with Labels
79+
80+
Compare multiple optimization strategies:
81+
82+
```bash
83+
python tt-train/scripts/plot_training_comparison.py \
84+
--baseline fw_only.txt \
85+
--compare fw_bw_3_packs.txt fw_bw_4_packs.txt \
86+
--labels "Forward Only" "FW+BW 3 Packs" "FW+BW 4 Packs"
87+
```
88+
89+
### Customization Options
90+
91+
```bash
92+
python tt-train/scripts/plot_training_comparison.py \
93+
--baseline baseline.txt \
94+
--compare optimized.txt \
95+
--output-dir ./plots \
96+
--warmup-steps 20 \
97+
--max-steps 5000 \
98+
--title-prefix "NanoLlama SiLU "
99+
```
100+
101+
### All Options
102+
103+
| Option | Default | Description |
104+
|--------|---------|-------------|
105+
| `--baseline` | (required) | Path to baseline log file |
106+
| `--compare` | `[]` | Paths to log files to compare |
107+
| `--labels` | filenames | Custom labels for each run |
108+
| `--output-dir` | `.` | Directory for output plots |
109+
| `--warmup-steps` | `15` | Steps to skip for timing analysis |
110+
| `--max-steps` | all | Limit steps in loss plots |
111+
| `--title-prefix` | `""` | Prefix for plot titles |
112+
113+
---
114+
115+
## Output
116+
117+
The script generates three plots:
118+
119+
### 1. `losses.png` - Loss Comparison
120+
Shows training loss curves for all runs overlaid. Useful for verifying that optimizations don't degrade convergence.
121+
122+
### 2. `losses_diff.png` - Loss Difference
123+
Shows the difference in loss between each compared run and the baseline. Values near zero indicate equivalent training quality.
124+
125+
### 3. `step_time.png` - Step Time Comparison
126+
Shows per-step execution time for all runs. Lower is better.
127+
128+
### Console Output
129+
130+
The script also prints summary statistics:
131+
132+
```
133+
SUMMARY STATISTICS
134+
============================================================
135+
136+
Mean Step Times:
137+
baseline: 703.14 ms (std: 12.34 ms)
138+
optimized: 650.22 ms (std: 10.56 ms)
139+
140+
Speedup relative to 'baseline':
141+
optimized: 1.081x
142+
143+
Final Loss (last 100 steps average):
144+
baseline: 3.456789
145+
optimized: 3.456123
146+
```
147+
148+
---
149+
150+
## Example Workflow
151+
152+
### Evaluating a SiLU Kernel Fusion
153+
154+
1. **Run baseline training:**
155+
```bash
156+
./build/tt-train/sources/examples/nano_gpt/nano_gpt > baseline.txt 2>&1
157+
```
158+
159+
2. **Apply your kernel optimization and rebuild:**
160+
```bash
161+
./build_metal.sh -b Release --build-tt-train
162+
```
163+
164+
3. **Run optimized training:**
165+
```bash
166+
./build/tt-train/sources/examples/nano_gpt/nano_gpt > optimized.txt 2>&1
167+
```
168+
169+
4. **Compare results:**
170+
```bash
171+
python tt-train/scripts/plot_training_comparison.py \
172+
--baseline baseline.txt \
173+
--compare optimized.txt \
174+
--labels "Baseline" "SiLU Fused" \
175+
--output-dir ./silu_comparison \
176+
--title-prefix "SiLU Fusion: "
177+
```
178+
179+
5. **Review:**
180+
- Check `losses.png` - curves should overlap closely
181+
- Check `losses_diff.png` - differences should be near zero
182+
- Check `step_time.png` - optimized should be faster
183+
- Review printed speedup factor
184+
185+
---
186+
187+
## See Also
188+
189+
- [PROFILER.md](PROFILER.md) - Detailed kernel-level profiling
190+
- [MEMORY_TRACKING.md](MEMORY_TRACKING.md) - Memory usage analysis

0 commit comments

Comments
 (0)