Skip to content

[BOO] Limit threads used for numeric verification #1346

Open
keshavvinayak01 wants to merge 1 commit into
iree-org:mainfrom
keshavvinayak01:verify-numerics-failure
Open

[BOO] Limit threads used for numeric verification #1346
keshavvinayak01 wants to merge 1 commit into
iree-org:mainfrom
keshavvinayak01:verify-numerics-failure

Conversation

@keshavvinayak01

@keshavvinayak01 keshavvinayak01 commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Limit BLAS threads to 1 inside compute_cpu_reference() to prevent OpenBLAS from exceeding its compiled-in thread limit (128) on high-core-count machines, which causes a segfault.

Fixes #1336

Limit BLAS to a single thread inside compute_cpu_reference(), which
exists for correctness not performance.  Without this, OpenBLAS tries
to spawn as many threads as there are CPU cores and exceeds its
compiled-in limit (typically 128), causing a segfault.

Fixes iree-org#1336

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>
@keshavvinayak01 keshavvinayak01 changed the title Fix segfault in --verify-numerics on machines with >128 CPU cores [BOO] Limit threads used for numeric verification Apr 16, 2026
@keshavvinayak01

Copy link
Copy Markdown
Contributor Author

This should fix the RDNA failure @yash-amd

@rkayaith

rkayaith commented Apr 16, 2026

Copy link
Copy Markdown
Member

1 thread would be pretty slow, can you just use min(num_threads, 128)

@rkayaith

Copy link
Copy Markdown
Member

Also I noticed in the issue it was reported that a nightly TheRock version of pytorch was used, not a stable release. Could you check if stable versions are hitting this as well? If not, I think it'd be better to work around this locally for now, and report a bug to the appropriate repo.

@keshavvinayak01

Copy link
Copy Markdown
Contributor Author

Also I noticed in the issue it was reported that a nightly TheRock version of pytorch was used, not a stable release. Could you check if stable versions are hitting this as well? If not, I think it'd be better to work around this locally for now, and report a bug to the appropriate repo.

@yash-amd Please check.

@yash-amd

yash-amd commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

@yash-amd

yash-amd commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Also I noticed in the issue it was reported that a nightly TheRock version of pytorch was used, not a stable release. Could you check if stable versions are hitting this as well? If not, I think it'd be better to work around this locally for now, and report a bug to the appropriate repo.

i have asked @deedongala from ossci team to check for the rocm version installed on the mi355 nod-ai runner, as on the runner when running rocminfo it was showing "Marketing Name" as "AMD Radeon Graphics" instead of "AMD Instinct MI355X" as we see on the other conductor machines like 10-09.
@deedongala any update on this?

@rkayaith

Copy link
Copy Markdown
Member

i have asked @deedongala from ossci team to check for the rocm version installed on the mi355 nod-ai runner

In the meantime, can you try a test run on the CI with pytorch 2.10 installed with --index-url https://download.pytorch.org/whl/rocm7.1, and without this fix, to see if the error still ocurrs.

@yash-amd

yash-amd commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

i have asked @deedongala from ossci team to check for the rocm version installed on the mi355 nod-ai runner

In the meantime, can you try a test run on the CI with pytorch 2.10 installed with --index-url https://download.pytorch.org/whl/rocm7.1, and without this fix, to see if the error still ocurrs.

yeah i tested this again and it gives the same output as "AMD Radeon Graphics" instead of "AMD Instinct MI355" in the Setup Environment job below using

  pip install "torch>=2.5,<=2.10.0" --index-url https://download.pytorch.org/whl/rocm7.1
  python3 -c "import torch; props = torch.cuda.get_device_properties(0); print(props.name)"

https://github.com/nod-ai/amd-shark-ai/actions/runs/24546996337/job/71764661591?pr=2894#step:6:82

@rkayaith

Copy link
Copy Markdown
Member

This PR is about addressing a different issue (OpenBLAS warning: precompiled NUM_THREADS exceeded segfault when running tests), do you still see that error with the stable pytorch?

@yash-amd

yash-amd commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

This PR is about addressing a different issue (OpenBLAS warning: precompiled NUM_THREADS exceeded segfault when running tests), do you still see that error with the stable pytorch?

yeah, i was not talking about OpenBLAS warning issue, that is solved.
I was taking about this https://xilinx.slack.com/archives/C08JKR35LRY/p1774394516655799

@rkayaith

rkayaith commented Apr 17, 2026

Copy link
Copy Markdown
Member

yeah, i was not talking about OpenBLAS warning issue, that is solved.

so this PR isn't necessary anymore? To clarify, I was asking if stable pytorch without this fix still hits the OpenBLAS issue.

@yash-amd

yash-amd commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

yeah, i was not talking about OpenBLAS warning issue, that is solved.

so this PR isn't necessary anymore? To clarify, I was asking if stable pytorch without this fix still hits the OpenBLAS issue.

oh ok, i haven't check with stable pytorch if this issue(OpenBlas warning) is coming or not.

@yash-amd

yash-amd commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

yeah, i was not talking about OpenBLAS warning issue, that is solved.

so this PR isn't necessary anymore? To clarify, I was asking if stable pytorch without this fix still hits the OpenBLAS issue.

they are working when using stable version on tom iree-turbine :
for rdna4: https://github.com/nod-ai/amd-shark-ai-reports/blob/main/boo/boo-custom-runs-gfx120X/2026-04-17_17-51/rdna4_attention_shapes_miopen_iree.csv

for mi355: https://github.com/nod-ai/amd-shark-ai-reports/blob/main/boo/boo-custom-runs/2026-04-17_17-12/attention_shapes_miopen_iree.csv

@rkayaith

Copy link
Copy Markdown
Member

okay I think it'll be best to work around this in CI for now by setting the OPENBLAS_NUM_THREADS env var, and report this as an issue to TheRock so they can fix this.

@yash-amd

yash-amd commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

okay I think it'll be best to work around this in CI for now by setting the OPENBLAS_NUM_THREADS env var, and report this as an issue to TheRock so they can fix this.

also one thing to notice, using the stable pytorch version, we are getting N.A values for six configs with iree as backend. As can be seen in this file https://github.com/nod-ai/amd-shark-ai-reports/blob/main/boo/boo-custom-runs-gfx120X/2026-04-17_17-51/rdna4_attention_shapes_miopen_iree.csv

@yash-amd

yash-amd commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

okay I think it'll be best to work around this in CI for now by setting the OPENBLAS_NUM_THREADS env var, and report this as an issue to TheRock so they can fix this.

also one thing to notice, using the stable pytorch version, we are getting N.A values for six configs with iree as backend. As can be seen in this file https://github.com/nod-ai/amd-shark-ai-reports/blob/main/boo/boo-custom-runs-gfx120X/2026-04-17_17-51/rdna4_attention_shapes_miopen_iree.csv

this might be because of

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 6.00 GiB. GPU 0 has a total capacity of 15.92 GiB of which 3.70 GiB is free. Of the allocated memory 3.19 GiB is allocated by PyTorch, and 1.81 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  

for rdna4 in the logs https://github.com/nod-ai/amd-shark-ai/actions/runs/24578284442/job/71869254564#step:15:120

i tried setting the flag PYTORCH_ALLOC_CONF=expandable_segments:True as well but still getting the same "torch.OutOfMemoryError: HIP out of memory".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Getting Segmentation fault while running attention shapes with "--verify-numerics" option

3 participants