You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The scheduled ci-latest-slang "Slang branch: master" nightly on 2026-05-22 failed
in build (linux, Debug, 3.10) when pytest-xdist worker gw1 aborted
(Fatal Python error: Aborted, SIGABRT) while running slangpy/tests/slangpy_tests/test_textures.py::test_texture_1d_broadcast[DeviceType.vulkan].
A manual re-run of the same workflow on the same slang sha passed, and a
follow-up bisect against the suspect slang commit (#shader-slang/slang#11110)
and its parent both passed on the same matrix. So this is non-deterministic;
it's a flake, not a slang regression.
The exact crashing test (test_texture_1d_broadcast[DeviceType.vulkan])
ran and passed on both bisect runs. So three data points now agree that
slang master is not at fault.
Symptoms
3682 passed, 330 skipped, 1 failed in the affected run.
Sole failure: worker gw1 crashed mid-test; pytest-xdist replaced it
and continued. No assertion message — the abort was native.
The actual abort reason is not in the log: the worker was running
with crashpad active, which produced hundreds of [ERROR elf_dynamic_array_reader.h:64] tag not found lines while
symbolizing the crash, and consumed whatever stderr message the
aborting code wrote. The Python stack at the top of the dump is just
the test calling into nanobind; we can't see what the C++ side aborted on.
The python suite is invoked as pytest -n auto --maxprocesses=4, so up
to 4 worker processes are concurrently exercising the same nvrgfx
Vulkan device. The crashing test starts ~1 ms after another worker
finishes a CUDA texture test on the shared device.
What we need to make this diagnosable
Preserve the crashpad minidump as an artifact (or disable crashpad
in CI) so the next occurrence gives us the actual abort message
instead of elf_dynamic_array_reader: tag not found spam. The crash-reports-linux-x86_64-gcc-Debug upload step already exists —
we just need to confirm minidumps land in it, and ideally print a
short crashpad summary to stderr before the worker dies.
Either isolate Vulkan device per pytest-xdist worker, or serialize
Vulkan-touching tests under a single worker. Concurrent submissions
from 4 workers to one shared device is the most plausible failure
surface and matches the symptom shape — slangpy's test suite already
silently skips several tests for "race condition doesn't reproduce
reliably on CI machines of varying specs" (test_torchbuffers.py:185)
and similar reasons, so worker-vs-worker races are an acknowledged
problem area.
Adopt the retry logic requested in Add similar retry logic than in Slang SlangPy tests #829 so the next single-test
abort doesn't take the whole nightly red while the underlying race
gets investigated. (Until that's in place, every recurrence of this
flake will take CI red and consume on-call attention to triage.)
Related issues (not duplicates)
There isn't enough evidence to call this a duplicate of any of these,
but they overlap in mechanism or environment and a triager should be
aware:
Memory Leak when using Non-CUDA Backends (test with Vulkan) with PyTorch Interop #827 — Memory leak on non-CUDA backends with PyTorch
interop. Plausibly a contributing environment factor — the slangpy
suite runs many torch+Vulkan tests in the same pytest session, so VRAM
pressure could compound by the time this test runs. Not the same bug.
Test suite already silently skips other Vulkan-related flakes: test_torchbuffers.py:185, test_torchintegration.py:173, test_transforms.py:177 — suggesting this class of intermittent failure
is recognized but not centrally tracked.
Summary
The scheduled
ci-latest-slang"Slang branch: master" nightly on 2026-05-22 failedin
build (linux, Debug, 3.10)when pytest-xdist workergw1aborted(
Fatal Python error: Aborted, SIGABRT) while runningslangpy/tests/slangpy_tests/test_textures.py::test_texture_1d_broadcast[DeviceType.vulkan].A manual re-run of the same workflow on the same slang sha passed, and a
follow-up bisect against the suspect slang commit (#shader-slang/slang#11110)
and its parent both passed on the same matrix. So this is non-deterministic;
it's a flake, not a slang regression.
Affected runs
https://github.com/shader-slang/slangpy/actions/runs/26263644712/job/77302274498
(linux Debug 3.10; runner
2u1g-b650-0468innvrgfxgroup)faa91ff2a223d922f0acb9e567bd40b7063e6df6on the same workflow + matrix.
a2ad34d7a) — passed:https://github.com/shader-slang/slangpy/actions/runs/26278283513
9b044ad46, Fix SPIR-V void pointer array stride slang#11110) — passed:https://github.com/shader-slang/slangpy/actions/runs/26278287261
The exact crashing test (
test_texture_1d_broadcast[DeviceType.vulkan])ran and passed on both bisect runs. So three data points now agree that
slang master is not at fault.
Symptoms
gw1crashed mid-test; pytest-xdist replaced itand continued. No assertion message — the abort was native.
with crashpad active, which produced hundreds of
[ERROR elf_dynamic_array_reader.h:64] tag not foundlines whilesymbolizing the crash, and consumed whatever stderr message the
aborting code wrote. The Python stack at the top of the dump is just
the test calling into nanobind; we can't see what the C++ side aborted on.
pytest -n auto --maxprocesses=4, so upto 4 worker processes are concurrently exercising the same
nvrgfxVulkan device. The crashing test starts ~1 ms after another worker
finishes a CUDA texture test on the shared device.
What we need to make this diagnosable
in CI) so the next occurrence gives us the actual abort message
instead of
elf_dynamic_array_reader: tag not foundspam. Thecrash-reports-linux-x86_64-gcc-Debugupload step already exists —we just need to confirm minidumps land in it, and ideally print a
short crashpad summary to stderr before the worker dies.
Vulkan-touching tests under a single worker. Concurrent submissions
from 4 workers to one shared device is the most plausible failure
surface and matches the symptom shape — slangpy's test suite already
silently skips several tests for "race condition doesn't reproduce
reliably on CI machines of varying specs" (
test_torchbuffers.py:185)and similar reasons, so worker-vs-worker races are an acknowledged
problem area.
abort doesn't take the whole nightly red while the underlying race
gets investigated. (Until that's in place, every recurrence of this
flake will take CI red and consume on-call attention to triage.)
Related issues (not duplicates)
There isn't enough evidence to call this a duplicate of any of these,
but they overlap in mechanism or environment and a triager should be
aware:
(Vulkan/D3D12 ↔ CUDA). Same bug class (race in Vulkan-touching code)
but different mechanism — Interop buffer cleanup race: cuImportExternalMemory fails after async memset on Vulkan/D3D12 interop buffers #929 is a within-process CUDA-imports-Vulkan
memory lifecycle race; the crashing test here is plain Vulkan with no
CUDA interop.
interop. Plausibly a contributing environment factor — the slangpy
suite runs many torch+Vulkan tests in the same pytest session, so VRAM
pressure could compound by the time this test runs. Not the same bug.
Test suite already silently skips other Vulkan-related flakes:
test_torchbuffers.py:185,test_torchintegration.py:173,test_transforms.py:177— suggesting this class of intermittent failureis recognized but not centrally tracked.