This tool analyzes performance traces from Metal operations, providing insights into throughput, bottlenecks, and optimization opportunities.
This tool can be installed from PyPI:
pipx install tt-perf-reportInstalling with pipx will automatically create a virtual environment and make the tt-perf-report command available.
- Build Metal with performance tracing (enabled in default build):
./build_metal- Run your test in TT-Metal with the tracy module to capture traces:
python -m tracy -r -p -v -m pytest path/to/test.pyThis generates a CSV file containing operation timing data.
Tracy signposts mark specific sections of code for analysis. Add signposts to your Python code:
import tracy
# Mark different sections of your code
tracy.signpost("Compilation pass")
model(input_data)
tracy.signpost("Performance pass")
for _ in range(10):
model(input_data)The tool uses the last signpost by default, which is typically the most relevant section for a performance test(e.g., the final iteration after compilation / warmup).
Common signpost usage:
--start-signpost NAME: Analyze ops after the specified signpost--end-signpost NAME: Analyze ops before the specified signpost--ignore-signposts: Analyze the entire trace--print-signposts: Prints any signposts within the window defined when using the start/end signpost arguments
The output of the performance report is a table of operations. Each operation is assigned a unique ID starting from 1. You can re-run the tool with different IDs to focus on specific sections of the trace.
Use --id-range to analyze specific sections:
# Analyze ops 5 through 10
tt-perf-report trace.csv --id-range 5-10
# Analyze from op 31 onwards
tt-perf-report trace.csv --id-range 31-
# Analyze up to op 12
tt-perf-report trace.csv --id-range -12This is particularly useful for:
- Isolating decode pass in prefill+decode LLM inference
- Analyzing single transformer layers without embeddings/projections
- Focusing on specific model components
--min-percentage value: Hide ops below specified % of total time (default: 0.5)--color/--no-color: Force colored/plain output--csv FILENAME: Output the table to CSV format for further analysis or inclusion into automated reporting pipelines--no-advice: Show only performance table, skip optimization advice
The performance report provides several key metrics for analyzing operation performance:
- Device Time: Time spent executing the operation on device (in microseconds)
- Op-to-op Gap: Time between operations, including host overhead and kernel dispatch (in microseconds)
- Total %: Percentage of total execution time spent on this operation
- Cores: Number of cores used by the operation (max 64 on Wormhole)
- DRAM: Memory bandwidth achieved (in GB/s)
- DRAM %: Percentage of theoretical peak DRAM bandwidth (288 GB/s on Wormhole)
- FLOPs: Compute throughput achieved (in TFLOPs)
- FLOPs %: Percentage of theoretical peak compute for the given math fidelity
- Bound: Performance classification of the operation:
DRAM: Memory bandwidth bound (>65% of peak DRAM)FLOP: Compute bound (>65% of peak FLOPs)BOTH: Both memory and compute boundSLOW: Neither memory nor compute boundHOST: Operation running on host CPU
- Math Fidelity: Precision configuration used for matrix operations:
HiFi4: Highest precision (74 TFLOPs/core)HiFi2: Medium precision (148 TFLOPs/core)LoFi: Lowest precision (262 TFLOPs/core)
The tool automatically highlights potential optimization opportunities:
- Red op-to-op times indicate high host or kernel launch overhead (>6.5μs)
- Red core counts indicate underutilization (<10 cores)
- Green metrics indicate good utilization of available resources
- Yellow metrics indicate room for optimization
Note:
trace.csvin the examples below refers to your input CSV file (the performance trace you want to analyze).
Typical use:
tt-perf-report trace.csvMerge traces captured on multiple machines from the same workload run:
tt-perf-report trace_host0.csv trace_host1.csv trace_host2.csvBuild a table of all ops with no advice:
tt-perf-report trace.csv --no-adviceView ops 100-200 with advice:
tt-perf-report trace.csv --id-range 100-200Export the table of ops and columns as a CSV file:
tt-perf-report trace.csv --csv my_report.csv