Skip to content

[Do not Review] Add options to skip commands with iree-boo-driver#1317

Draft
pravg-amd wants to merge 4 commits into
iree-org:mainfrom
pravg-amd:add_command_timeout
Draft

[Do not Review] Add options to skip commands with iree-boo-driver#1317
pravg-amd wants to merge 4 commits into
iree-org:mainfrom
pravg-amd:add_command_timeout

Conversation

@pravg-amd

Copy link
Copy Markdown
  1. Add option to specify commands to skip from the list
  2. Add timeout option to specify per command timeout limit
  3. Temporary fix to handle bfloat16 issue in pytorch by casting to float32 while comparing numerics

praveen-g-ctt and others added 4 commits February 27, 2026 11:02
Workaround PyTorch/ROCm bug where convolution_backward crashes with
bfloat16/float16 on GPU by casting to float32 for PyTorch reference.

The issue occurs when running numeric verification with --verify-numerics
flag on operations using bfloat16 or float16. PyTorch's convolution_backward
operation crashes with a floating-point exception (exit code 136) on ROCm/GPU
for these dtypes.

This fix detects half-precision inputs and temporarily casts them to float32
when running the PyTorch GPU reference, then casts results back to the
original dtype for comparison. Applied in both collect_error_samples() and
run_structured_test() functions.

Tested with bfloat16, float16, and float32 - all now work correctly.
Add support for filtering commands from the main commands file by
providing a skip-commands-file that lists commands to exclude. This
allows users to selectively skip specific test cases without modifying
the main commands file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit adds timeout functionality to iree-boo-driver to prevent
commands from running indefinitely. Key changes:

- Add --timeout argument to specify timeout in seconds for each command
- Implement _call_with_timeout() using multiprocessing to enforce timeouts
- Extract command execution logic into _execute_single_command() function
- Wrap command execution with timeout handler in main loop
- Gracefully handle timeout errors and record them in CSV output

When a command times out, it is terminated and marked as "timeout" in
the CSV output, allowing the driver to continue with remaining commands.

This is useful for:
- Detecting hung or infinite-loop kernels
- Setting time budgets for large benchmark runs
- Preventing single problematic configs from blocking entire test suites

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…river

Changes:
- Use multiprocessing 'spawn' method instead of 'fork' to avoid CUDA
  re-initialization errors in subprocesses
- Refactor _execute_single_command to create unpicklable objects
  (ArgumentParser, torch.device) inside subprocess for spawn compatibility
- Improve timeout handling: use SIGKILL immediately instead of SIGTERM
  to handle stuck GPU operations more aggressively
- Add GPU state cleanup after timeout to prevent issues in subsequent tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants