Skip to content

Conversation

@rpavlovicTT
Copy link
Contributor

This commit introduces three new scripts for automating performance regression investigation through git bisect:

  1. bisect_perf.sh - Standalone bisect script for tt-xla commits

    • Tests individual tt-xla commits for performance regressions
    • Supports custom benchmark commands, thresholds, and metric patterns
    • Handles build failures and benchmark crashes gracefully (exit 125)
    • Can be used directly with: git bisect run ./scripts/bisect_perf.sh
  2. bisect_ttmlir_perf.sh - Standalone bisect script for tt-mlir commits

    • Tests individual tt-mlir commits within the tt-mlir submodule
    • Dynamically modifies CMakeLists.txt to test specific tt-mlir versions
    • Supports incremental builds for faster iteration
    • Logs all tests to /tmp/bisect_ttmlir_*.log for later analysis
  3. bisect_perf_auto.sh - Master orchestrator (RECOMMENDED)

    • Fully automated two-phase bisect: tt-xla first, then tt-mlir if needed
    • Automatically detects tt-mlir uplift commits
    • Drills down into tt-mlir submodule to find exact regression commit
    • Comprehensive logging and final summary report
    • Single command operation: ./scripts/bisect_perf_auto.sh -g -b
  4. cmake_fix.patch - CMakeLists.txt patch for incremental builds

    • Removes --target tt-alchemist to enable faster incremental builds
    • Applied automatically by bisect scripts

All scripts support:

  • Custom benchmark commands (-c, --command)
  • Custom performance thresholds (-t, --threshold)
  • Custom metric extraction patterns (-p, --pattern)
  • Comprehensive help documentation (-h, --help)
  • No hardcoded paths - works from any tt-xla location

Default configuration:

  • Benchmark: resnet model with batch size 8, bfloat16, 128 loops
  • Threshold: 680 samples/second
  • Pattern: "Sample per second:\s*\K[0-9.]+"

Example usage:
./scripts/bisect_perf_auto.sh -g 051ebb2 -b HEAD

Successfully tested and used to identify tt-mlir commit a868fa2a9 ([TTIRFusing] Improve ScaledSumToMeanPattern) as the root cause of a performance regression in resnet (from ~700 to ~574 samples/sec).

🤖 Generated with Claude Code

@jameszianxuTT
Copy link
Contributor

jameszianxuTT commented Nov 24, 2025

Some non-blocking comments:

  1. I think these kind of helper scripts tend to randomly break over time since the surface area for breaking changes is so large, any thoughts about how we can control for that and maintain these tools? Maybe something that can run in weekly CI?
  2. For bisecting into tt-mlir, we have other helper scripts in uplifts tooling to "unroll" tt-mlir commits one by one into tt-xla, so we can just run a bisect in tt-xla without entering & manipulating tt-mlir in-tree. We can even do this recursive uplifting of individual metal commits into tt-xla, but there are some complexities involved. Your way works too!
image

@github-actions
Copy link
Contributor

github-actions bot commented Nov 24, 2025

TestsPassed ✅Skipped ⚠️Failed
TT-XLA Tests10 ran6 passed4 skipped0 failed
TestResult
No test annotations available

BINARY_DIR ${TTMLIR_BUILD_DIR}

BUILD_COMMAND env ${WITH_METAL_RUNTIME_ROOT_SET} ${CMAKE_COMMAND} --build <BINARY_DIR>
- COMMAND ${CMAKE_COMMAND} --build <BINARY_DIR> --target tt-alchemist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svuckovicTT Can this patch just be part of the changes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline, it's okay, will remove soon as there will be no need for it.

@rpavlovicTT
Copy link
Contributor Author

Some non-blocking comments:

  1. I think these kind of helper scripts tend to randomly break over time since the surface area for breaking changes is so large, any thoughts about how we can control for that and maintain these tools? Maybe something that can run in weekly CI?
  2. For bisecting into tt-mlir, we have other helper scripts in uplifts tooling to "unroll" tt-mlir commits one by one into tt-xla, so we can just run a bisect in tt-xla without entering & manipulating tt-mlir in-tree. We can even do this recursive uplifting of individual metal commits into tt-xla, but there are some complexities involved. Your way works too!
image

Disclaimer - the script was born out of necessity for me to automate time consuming manual work. If other teammates start using it, we can make it better and integrate into Claude directly (via plugins).

  1. I agree this could be hard to maintain. We can definitely create some workflow just to test it over time.
  2. I guess it depends on how you look at it. To me, it makes sense to do bisect level by level. We can add tt-metal recursively too if needed. Not sure what is the advantage of unrolling tt-mlir commits into tt-xla, other than not having to manipulate tt-mlir repo?

Btw thanks for sharing your repo, I wasn't aware of it. If you think some of those scripts are ready for use for other developers who work on uplifting, please share!

@AleksKnezevic
Copy link
Contributor

@jameszianxuTT and @brataTT, FYI since you already had some bisect scripts

Copy link
Contributor

@brataTT brataTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff. Learned some things from it.

Wasn't aware we could use git bisect view that way.
In uplift scripts, we use it to get a status/summary of long running bisects instead.
Separating logs by commits instead of commands is also a good idea.
I'll check if I can steal these ideas to improve metal uplift scripts too.

Concerns

  • I don't see any clean-build step.
    That'll make running this really fast 👍 .
    But if there are unreliable results at any point, it may be useful to add that step.
    With that said, I have no specific examples of when this can happen or if this is a valid concern.
  • Patches: If mlir introduces breaking changes that required changes in tt-xla uplift PR, there will be a larger number of untestable changes.
    In metal uplift, we solve this by keeping separate downstream fix patches per breaking upstream commit.
    Then, we apply those patches starting from the breaking change onwards.
    The unrolling @jameszianxuTT mentioned simplifies this.

# Reset bisect before checking files
git bisect reset

# Check if this is a tt-mlir uplift commit
Copy link
Contributor

@brataTT brataTT Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some AI slop from claude in this part.

The only reliable source of truth for detecting an uplift commit here should be the second case (checking TT_MLIR_VERSION), and even then, changes around that line will also falsely detect an uplift commit since git diff is used.

Better approach would be to extract the mlir version from BAD and BAD^ individually, then compare those strings for equality upto the length of shortest string (to compare between short SHA and long SHA).

Case 1 assumes people use the templated uplift PR (not always true). Case 3 is not how we use tt-mlir, but forge-model repo, so it tries to detect tt-forge-models uplift as tt-mlir uplift. Neither is a huge deal due to the check you added after for if good_mlir = bad_mlir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I'll rewrite this check to explicitly compare tt-mlir SHAs.

@rpavlovicTT
Copy link
Contributor Author

Great stuff. Learned some things from it.

Wasn't aware we could use git bisect view that way. In uplift scripts, we use it to get a status/summary of long running bisects instead. Separating logs by commits instead of commands is also a good idea. I'll check if I can steal these ideas to improve metal uplift scripts too.

Concerns

  • I don't see any clean-build step.
    That'll make running this really fast 👍 .
    But if there are unreliable results at any point, it may be useful to add that step.
    With that said, I have no specific examples of when this can happen or if this is a valid concern.
  • Patches: If mlir introduces breaking changes that required changes in tt-xla uplift PR, there will be a larger number of untestable changes.
    In metal uplift, we solve this by keeping separate downstream fix patches per breaking upstream commit.
    Then, we apply those patches starting from the breaking change onwards.
    The unrolling @jameszianxuTT mentioned simplifies this.

About your concerns --

  1. I intentionally let it build incrementally just to keep progress quicker. Ideally, we should fall back to clean build if build fails.
  2. That is a great pointer. This seems to me as very difficult problem. As I understand it, we could have a tt-mlir uplift commit that contains some tt-xla code changes, and we don't know which tt-mlir commits contained in the uplift work with these changes and which don't.
    I'm curious about separate downstream fix patches per breaking upstream commit. -- how do you create these patches?

Anyway as I said, this is a demo version I made quickly with Claude just to help us out.

@rpavlovicTT rpavlovicTT force-pushed the rpavlovic/perf-bisect-tools branch from a715f90 to 2060e97 Compare November 25, 2025 14:25
@brataTT
Copy link
Contributor

brataTT commented Nov 25, 2025

  1. That is a great pointer. This seems to me as very difficult problem. As I understand it, we could have a tt-mlir uplift commit that contains some tt-xla code changes, and we don't know which tt-mlir commits contained in the uplift work with these changes and which don't.
    I'm curious about separate downstream fix patches per breaking upstream commit. -- how do you create these patches?

Anyway as I said, this is a demo version I made quickly with Claude just to help us out.

Agreed, this should not block your PR, more of an improvement.
Creating the patches unfortunately is a manual job.
For metal uplift, we keep a list of them tagged with after metal commit ... in the PR branch([Example PR with patches]).
Idk if there was a need to do this for mlir -> xla uplifts before this script, or if anyone does it for frontend uplifts at all.

This commit introduces three new scripts for automating performance regression
investigation through git bisect:

1. **bisect_perf.sh** - Standalone bisect script for tt-xla commits
   - Tests individual tt-xla commits for performance regressions
   - Supports custom benchmark commands, thresholds, and metric patterns
   - Handles build failures and benchmark crashes gracefully (exit 125)
   - Can be used directly with: git bisect run ./scripts/bisect_perf.sh

2. **bisect_ttmlir_perf.sh** - Standalone bisect script for tt-mlir commits
   - Tests individual tt-mlir commits within the tt-mlir submodule
   - Dynamically modifies CMakeLists.txt to test specific tt-mlir versions
   - Supports incremental builds for faster iteration
   - Logs all tests to /tmp/bisect_ttmlir_*.log for later analysis

3. **bisect_perf_auto.sh** - Master orchestrator (RECOMMENDED)
   - Fully automated two-phase bisect: tt-xla first, then tt-mlir if needed
   - Automatically detects tt-mlir uplift commits
   - Drills down into tt-mlir submodule to find exact regression commit
   - Comprehensive logging and final summary report
   - Single command operation: ./scripts/bisect_perf_auto.sh -g <good> -b <bad>

4. **cmake_fix.patch** - CMakeLists.txt patch for incremental builds
   - Removes --target tt-alchemist to enable faster incremental builds
   - Applied automatically by bisect scripts

All scripts support:
- Custom benchmark commands (-c, --command)
- Custom performance thresholds (-t, --threshold)
- Custom metric extraction patterns (-p, --pattern)
- Comprehensive help documentation (-h, --help)
- No hardcoded paths - works from any tt-xla location

Default configuration:
- Benchmark: resnet model with batch size 8, bfloat16, 128 loops
- Threshold: 680 samples/second
- Pattern: "Sample per second:\s*\K[0-9.]+"

Example usage:
  ./scripts/bisect_perf_auto.sh -g 051ebb2 -b HEAD

Successfully tested and used to identify tt-mlir commit a868fa2a9
([TTIRFusing] Improve ScaledSumToMeanPattern) as the root cause of
a performance regression in resnet (from ~700 to ~574 samples/sec).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@rpavlovicTT rpavlovicTT force-pushed the rpavlovic/perf-bisect-tools branch from 2060e97 to 52c0fa2 Compare November 25, 2025 14:43
@rpavlovicTT
Copy link
Contributor Author

Ping @AleksKnezevic @mrakitaTT @nvukobratTT as code owners to take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants