Add automated performance bisect tools for tt-xla and tt-mlir #2303

rpavlovicTT · 2025-11-24T15:27:18Z

This commit introduces three new scripts for automating performance regression investigation through git bisect:

bisect_perf.sh - Standalone bisect script for tt-xla commits
- Tests individual tt-xla commits for performance regressions
- Supports custom benchmark commands, thresholds, and metric patterns
- Handles build failures and benchmark crashes gracefully (exit 125)
- Can be used directly with: git bisect run ./scripts/bisect_perf.sh
bisect_ttmlir_perf.sh - Standalone bisect script for tt-mlir commits
- Tests individual tt-mlir commits within the tt-mlir submodule
- Dynamically modifies CMakeLists.txt to test specific tt-mlir versions
- Supports incremental builds for faster iteration
- Logs all tests to /tmp/bisect_ttmlir_*.log for later analysis
bisect_perf_auto.sh - Master orchestrator (RECOMMENDED)
- Fully automated two-phase bisect: tt-xla first, then tt-mlir if needed
- Automatically detects tt-mlir uplift commits
- Drills down into tt-mlir submodule to find exact regression commit
- Comprehensive logging and final summary report
- Single command operation: ./scripts/bisect_perf_auto.sh -g -b
cmake_fix.patch - CMakeLists.txt patch for incremental builds
- Removes --target tt-alchemist to enable faster incremental builds
- Applied automatically by bisect scripts

All scripts support:

Custom benchmark commands (-c, --command)
Custom performance thresholds (-t, --threshold)
Custom metric extraction patterns (-p, --pattern)
Comprehensive help documentation (-h, --help)
No hardcoded paths - works from any tt-xla location

Default configuration:

Benchmark: resnet model with batch size 8, bfloat16, 128 loops
Threshold: 680 samples/second
Pattern: "Sample per second:\s*\K[0-9.]+"

Example usage:
./scripts/bisect_perf_auto.sh -g 051ebb2 -b HEAD

Successfully tested and used to identify tt-mlir commit a868fa2a9 ([TTIRFusing] Improve ScaledSumToMeanPattern) as the root cause of a performance regression in resnet (from ~700 to ~574 samples/sec).

🤖 Generated with Claude Code

jameszianxuTT · 2025-11-24T15:40:09Z

Some non-blocking comments:

I think these kind of helper scripts tend to randomly break over time since the surface area for breaking changes is so large, any thoughts about how we can control for that and maintain these tools? Maybe something that can run in weekly CI?
For bisecting into tt-mlir, we have other helper scripts in uplifts tooling to "unroll" tt-mlir commits one by one into tt-xla, so we can just run a bisect in tt-xla without entering & manipulating tt-mlir in-tree. We can even do this recursive uplifting of individual metal commits into tt-xla, but there are some complexities involved. Your way works too!

github-actions · 2025-11-24T15:40:56Z

	Tests	Passed ✅	Skipped ⚠️	Failed
TT-XLA Tests	10 ran	6 passed	4 skipped	0 failed

Test	Result
No test annotations available

View TT-XLA Tests

odjuricicTT · 2025-11-24T15:59:03Z

scripts/cmake_fix.patch

+         BINARY_DIR ${TTMLIR_BUILD_DIR}
+
+         BUILD_COMMAND env ${WITH_METAL_RUNTIME_ROOT_SET} ${CMAKE_COMMAND} --build <BINARY_DIR>
+-        COMMAND ${CMAKE_COMMAND} --build <BINARY_DIR> --target tt-alchemist


@svuckovicTT Can this patch just be part of the changes?

Synced offline, it's okay, will remove soon as there will be no need for it.

rpavlovicTT · 2025-11-24T16:23:07Z

Some non-blocking comments:

I think these kind of helper scripts tend to randomly break over time since the surface area for breaking changes is so large, any thoughts about how we can control for that and maintain these tools? Maybe something that can run in weekly CI?

For bisecting into tt-mlir, we have other helper scripts in uplifts tooling to "unroll" tt-mlir commits one by one into tt-xla, so we can just run a bisect in tt-xla without entering & manipulating tt-mlir in-tree. We can even do this recursive uplifting of individual metal commits into tt-xla, but there are some complexities involved. Your way works too!

Disclaimer - the script was born out of necessity for me to automate time consuming manual work. If other teammates start using it, we can make it better and integrate into Claude directly (via plugins).

I agree this could be hard to maintain. We can definitely create some workflow just to test it over time.
I guess it depends on how you look at it. To me, it makes sense to do bisect level by level. We can add tt-metal recursively too if needed. Not sure what is the advantage of unrolling tt-mlir commits into tt-xla, other than not having to manipulate tt-mlir repo?

Btw thanks for sharing your repo, I wasn't aware of it. If you think some of those scripts are ready for use for other developers who work on uplifting, please share!

AleksKnezevic · 2025-11-24T16:45:19Z

@jameszianxuTT and @brataTT, FYI since you already had some bisect scripts

brataTT

Great stuff. Learned some things from it.

Wasn't aware we could use git bisect view that way.
In uplift scripts, we use it to get a status/summary of long running bisects instead.
Separating logs by commits instead of commands is also a good idea.
I'll check if I can steal these ideas to improve metal uplift scripts too.

Concerns

I don't see any clean-build step.
That'll make running this really fast 👍 .
But if there are unreliable results at any point, it may be useful to add that step.
With that said, I have no specific examples of when this can happen or if this is a valid concern.
Patches: If mlir introduces breaking changes that required changes in tt-xla uplift PR, there will be a larger number of untestable changes.
In metal uplift, we solve this by keeping separate downstream fix patches per breaking upstream commit.
Then, we apply those patches starting from the breaking change onwards.
The unrolling @jameszianxuTT mentioned simplifies this.

scripts/bisect_perf_auto.sh

brataTT · 2025-11-24T17:37:41Z

scripts/bisect_perf_auto.sh

+# Reset bisect before checking files
+git bisect reset
+
+# Check if this is a tt-mlir uplift commit


There might be some AI slop from claude in this part.

The only reliable source of truth for detecting an uplift commit here should be the second case (checking TT_MLIR_VERSION), and even then, changes around that line will also falsely detect an uplift commit since git diff is used.

Better approach would be to extract the mlir version from BAD and BAD^ individually, then compare those strings for equality upto the length of shortest string (to compare between short SHA and long SHA).

Case 1 assumes people use the templated uplift PR (not always true). Case 3 is not how we use tt-mlir, but forge-model repo, so it tries to detect tt-forge-models uplift as tt-mlir uplift. Neither is a huge deal due to the check you added after for if good_mlir = bad_mlir

You are right. I'll rewrite this check to explicitly compare tt-mlir SHAs.

rpavlovicTT · 2025-11-25T14:24:24Z

Great stuff. Learned some things from it.

Wasn't aware we could use git bisect view that way. In uplift scripts, we use it to get a status/summary of long running bisects instead. Separating logs by commits instead of commands is also a good idea. I'll check if I can steal these ideas to improve metal uplift scripts too.

Concerns

I don't see any clean-build step.
That'll make running this really fast 👍 .
But if there are unreliable results at any point, it may be useful to add that step.
With that said, I have no specific examples of when this can happen or if this is a valid concern.

Patches: If mlir introduces breaking changes that required changes in tt-xla uplift PR, there will be a larger number of untestable changes.
In metal uplift, we solve this by keeping separate downstream fix patches per breaking upstream commit.
Then, we apply those patches starting from the breaking change onwards.
The unrolling @jameszianxuTT mentioned simplifies this.

About your concerns --

I intentionally let it build incrementally just to keep progress quicker. Ideally, we should fall back to clean build if build fails.
That is a great pointer. This seems to me as very difficult problem. As I understand it, we could have a tt-mlir uplift commit that contains some tt-xla code changes, and we don't know which tt-mlir commits contained in the uplift work with these changes and which don't.
I'm curious about separate downstream fix patches per breaking upstream commit. -- how do you create these patches?

Anyway as I said, this is a demo version I made quickly with Claude just to help us out.

brataTT · 2025-11-25T14:35:17Z

That is a great pointer. This seems to me as very difficult problem. As I understand it, we could have a tt-mlir uplift commit that contains some tt-xla code changes, and we don't know which tt-mlir commits contained in the uplift work with these changes and which don't.
I'm curious about separate downstream fix patches per breaking upstream commit. -- how do you create these patches?

Anyway as I said, this is a demo version I made quickly with Claude just to help us out.

Agreed, this should not block your PR, more of an improvement.
Creating the patches unfortunately is a manual job.
For metal uplift, we keep a list of them tagged with after metal commit ... in the PR branch([Example PR with patches]).
Idk if there was a need to do this for mlir -> xla uplifts before this script, or if anyone does it for frontend uplifts at all.

This commit introduces three new scripts for automating performance regression investigation through git bisect: 1. **bisect_perf.sh** - Standalone bisect script for tt-xla commits - Tests individual tt-xla commits for performance regressions - Supports custom benchmark commands, thresholds, and metric patterns - Handles build failures and benchmark crashes gracefully (exit 125) - Can be used directly with: git bisect run ./scripts/bisect_perf.sh 2. **bisect_ttmlir_perf.sh** - Standalone bisect script for tt-mlir commits - Tests individual tt-mlir commits within the tt-mlir submodule - Dynamically modifies CMakeLists.txt to test specific tt-mlir versions - Supports incremental builds for faster iteration - Logs all tests to /tmp/bisect_ttmlir_*.log for later analysis 3. **bisect_perf_auto.sh** - Master orchestrator (RECOMMENDED) - Fully automated two-phase bisect: tt-xla first, then tt-mlir if needed - Automatically detects tt-mlir uplift commits - Drills down into tt-mlir submodule to find exact regression commit - Comprehensive logging and final summary report - Single command operation: ./scripts/bisect_perf_auto.sh -g <good> -b <bad> 4. **cmake_fix.patch** - CMakeLists.txt patch for incremental builds - Removes --target tt-alchemist to enable faster incremental builds - Applied automatically by bisect scripts All scripts support: - Custom benchmark commands (-c, --command) - Custom performance thresholds (-t, --threshold) - Custom metric extraction patterns (-p, --pattern) - Comprehensive help documentation (-h, --help) - No hardcoded paths - works from any tt-xla location Default configuration: - Benchmark: resnet model with batch size 8, bfloat16, 128 loops - Threshold: 680 samples/second - Pattern: "Sample per second:\s*\K[0-9.]+" Example usage: ./scripts/bisect_perf_auto.sh -g 051ebb2 -b HEAD Successfully tested and used to identify tt-mlir commit a868fa2a9 ([TTIRFusing] Improve ScaledSumToMeanPattern) as the root cause of a performance regression in resnet (from ~700 to ~574 samples/sec). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

rpavlovicTT · 2025-11-25T18:12:45Z

Ping @AleksKnezevic @mrakitaTT @nvukobratTT as code owners to take a look

rpavlovicTT requested review from AleksKnezevic, mrakitaTT and nvukobratTT as code owners November 24, 2025 15:27

rpavlovicTT requested a review from odjuricicTT November 24, 2025 15:41

odjuricicTT approved these changes Nov 24, 2025

View reviewed changes

rpavlovicTT force-pushed the rpavlovic/perf-bisect-tools branch from 0e40410 to a715f90 Compare November 24, 2025 17:19

rpavlovicTT mentioned this pull request Nov 24, 2025

Incremental tt-mlir build doesn't work #1670

Open

brataTT reviewed Nov 24, 2025

View reviewed changes

rpavlovicTT force-pushed the rpavlovic/perf-bisect-tools branch from a715f90 to 2060e97 Compare November 25, 2025 14:25

rpavlovicTT force-pushed the rpavlovic/perf-bisect-tools branch from 2060e97 to 52c0fa2 Compare November 25, 2025 14:43

Add automated performance bisect tools for tt-xla and tt-mlir #2303

Are you sure you want to change the base?

Add automated performance bisect tools for tt-xla and tt-mlir #2303

Conversation

rpavlovicTT commented Nov 24, 2025

Uh oh!

jameszianxuTT commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

odjuricicTT Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

svuckovicTT Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

rpavlovicTT commented Nov 24, 2025

Uh oh!

AleksKnezevic commented Nov 24, 2025

Uh oh!

brataTT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brataTT Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rpavlovicTT Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

rpavlovicTT commented Nov 25, 2025

Uh oh!

brataTT commented Nov 25, 2025

Uh oh!

rpavlovicTT commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jameszianxuTT commented Nov 24, 2025 •

edited

Loading

github-actions bot commented Nov 24, 2025 •

edited

Loading

brataTT Nov 24, 2025 •

edited

Loading