[Do not Review] Add options to skip commands with iree-boo-driver#1317
Draft
pravg-amd wants to merge 4 commits into
Draft
[Do not Review] Add options to skip commands with iree-boo-driver#1317pravg-amd wants to merge 4 commits into
pravg-amd wants to merge 4 commits into
Conversation
pravg-amd
commented
Mar 3, 2026
- Add option to specify commands to skip from the list
- Add timeout option to specify per command timeout limit
- Temporary fix to handle bfloat16 issue in pytorch by casting to float32 while comparing numerics
Workaround PyTorch/ROCm bug where convolution_backward crashes with bfloat16/float16 on GPU by casting to float32 for PyTorch reference. The issue occurs when running numeric verification with --verify-numerics flag on operations using bfloat16 or float16. PyTorch's convolution_backward operation crashes with a floating-point exception (exit code 136) on ROCm/GPU for these dtypes. This fix detects half-precision inputs and temporarily casts them to float32 when running the PyTorch GPU reference, then casts results back to the original dtype for comparison. Applied in both collect_error_samples() and run_structured_test() functions. Tested with bfloat16, float16, and float32 - all now work correctly.
Add support for filtering commands from the main commands file by providing a skip-commands-file that lists commands to exclude. This allows users to selectively skip specific test cases without modifying the main commands file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit adds timeout functionality to iree-boo-driver to prevent commands from running indefinitely. Key changes: - Add --timeout argument to specify timeout in seconds for each command - Implement _call_with_timeout() using multiprocessing to enforce timeouts - Extract command execution logic into _execute_single_command() function - Wrap command execution with timeout handler in main loop - Gracefully handle timeout errors and record them in CSV output When a command times out, it is terminated and marked as "timeout" in the CSV output, allowing the driver to continue with remaining commands. This is useful for: - Detecting hung or infinite-loop kernels - Setting time budgets for large benchmark runs - Preventing single problematic configs from blocking entire test suites Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…river Changes: - Use multiprocessing 'spawn' method instead of 'fork' to avoid CUDA re-initialization errors in subprocesses - Refactor _execute_single_command to create unpicklable objects (ArgumentParser, torch.device) inside subprocess for spawn compatibility - Improve timeout handling: use SIGKILL immediately instead of SIGTERM to handle stuck GPU operations more aggressively - Add GPU state cleanup after timeout to prevent issues in subsequent tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.