[Do not Review] Add options to skip commands with iree-boo-driver by pravg-amd · Pull Request #1317 · iree-org/iree-turbine

pravg-amd · 2026-03-03T18:58:49Z

Add option to specify commands to skip from the list
Add timeout option to specify per command timeout limit
Temporary fix to handle bfloat16 issue in pytorch by casting to float32 while comparing numerics

Workaround PyTorch/ROCm bug where convolution_backward crashes with bfloat16/float16 on GPU by casting to float32 for PyTorch reference. The issue occurs when running numeric verification with --verify-numerics flag on operations using bfloat16 or float16. PyTorch's convolution_backward operation crashes with a floating-point exception (exit code 136) on ROCm/GPU for these dtypes. This fix detects half-precision inputs and temporarily casts them to float32 when running the PyTorch GPU reference, then casts results back to the original dtype for comparison. Applied in both collect_error_samples() and run_structured_test() functions. Tested with bfloat16, float16, and float32 - all now work correctly.

Add support for filtering commands from the main commands file by providing a skip-commands-file that lists commands to exclude. This allows users to selectively skip specific test cases without modifying the main commands file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This commit adds timeout functionality to iree-boo-driver to prevent commands from running indefinitely. Key changes: - Add --timeout argument to specify timeout in seconds for each command - Implement _call_with_timeout() using multiprocessing to enforce timeouts - Extract command execution logic into _execute_single_command() function - Wrap command execution with timeout handler in main loop - Gracefully handle timeout errors and record them in CSV output When a command times out, it is terminated and marked as "timeout" in the CSV output, allowing the driver to continue with remaining commands. This is useful for: - Detecting hung or infinite-loop kernels - Setting time budgets for large benchmark runs - Preventing single problematic configs from blocking entire test suites Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…river Changes: - Use multiprocessing 'spawn' method instead of 'fork' to avoid CUDA re-initialization errors in subprocesses - Refactor _execute_single_command to create unpicklable objects (ArgumentParser, torch.device) inside subprocess for spawn compatibility - Improve timeout handling: use SIGKILL immediately instead of SIGTERM to handle stuck GPU operations more aggressively - Add GPU state cleanup after timeout to prevent issues in subsequent tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

praveen-g-ctt and others added 4 commits February 27, 2026 11:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Do not Review] Add options to skip commands with iree-boo-driver#1317

[Do not Review] Add options to skip commands with iree-boo-driver#1317
pravg-amd wants to merge 4 commits into
iree-org:mainfrom
pravg-amd:add_command_timeout

pravg-amd commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pravg-amd commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants