Improve BOO benchmark stability for regression detection

BOO benchmark results can be noisy (up to ~30us deviation observed), which makes automated regression bisection unreliable — reruns with default iterations sometimes produce times within threshold even when a real regression exists. It was reported that execution times exhibit two "modes" (one fast, one slow).

@Max191 has a [branch](https://github.com/iree-org/iree-turbine/compare/main...Max191:iree-turbine:more-stable-benchmarks) with two improvements:

1. **`--iter-sleep` flag**: Adds a device sync + configurable sleep between iterations, which reduces variance from thermal/power state effects.
2. **Stddev/min/max stats for multi-dispatch runs**: Currently these stats are only available for single-dispatch cases. The branch adds them for multi-dispatch by reporting stats from the longest-running dispatch, which gives a useful noise indicator for filtering bad runs.

Plan:
- [ ] Investigate the noise — reproduce the bimodal execution time distribution and characterize the two modes
- [ ] Test whether the proposed changes address the issue
- [ ] Land the changes (or an equivalent fix) based on findings

Context: https://xilinx.slack.com/archives/C08JKR35LRY/p1772034070912979

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve BOO benchmark stability for regression detection #2842

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improve BOO benchmark stability for regression detection #2842

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions