Intermittent abort in test_texture_1d_broadcast[DeviceType.vulkan] under pytest-xdist; recovered on retry

## Summary

The scheduled `ci-latest-slang` "Slang branch: master" nightly on 2026-05-22 failed
in `build (linux, Debug, 3.10)` when pytest-xdist worker `gw1` aborted
(`Fatal Python error: Aborted`, SIGABRT) while running
`slangpy/tests/slangpy_tests/test_textures.py::test_texture_1d_broadcast[DeviceType.vulkan]`.
A manual re-run of the same workflow on the same slang sha passed, and a
follow-up bisect against the suspect slang commit (#shader-slang/slang#11110)
and its parent **both passed** on the same matrix. So this is non-deterministic;
it's a flake, not a slang regression.

## Affected runs

- **First attempt — failed:**
  https://github.com/shader-slang/slangpy/actions/runs/26263644712/job/77302274498
  (linux Debug 3.10; runner `2u1g-b650-0468` in `nvrgfx` group)
- **Manual re-run — passed**, same slang sha `faa91ff2a223d922f0acb9e567bd40b7063e6df6`
  on the same workflow + matrix.
- **Bisect, parent of suspect commit (`a2ad34d7a`) — passed:**
  https://github.com/shader-slang/slangpy/actions/runs/26278283513
- **Bisect, suspect commit (`9b044ad46`, shader-slang/slang#11110) — passed:**
  https://github.com/shader-slang/slangpy/actions/runs/26278287261

The exact crashing test (`test_texture_1d_broadcast[DeviceType.vulkan]`)
ran and **passed** on both bisect runs. So three data points now agree that
slang master is not at fault.

## Symptoms

- 3682 passed, 330 skipped, 1 failed in the affected run.
- Sole failure: worker `gw1` crashed mid-test; pytest-xdist replaced it
  and continued. No assertion message — the abort was native.
- The actual abort reason is **not in the log**: the worker was running
  with crashpad active, which produced hundreds of
  `[ERROR elf_dynamic_array_reader.h:64] tag not found` lines while
  symbolizing the crash, and consumed whatever stderr message the
  aborting code wrote. The Python stack at the top of the dump is just
  the test calling into nanobind; we can't see what the C++ side aborted on.
- The python suite is invoked as `pytest -n auto --maxprocesses=4`, so up
  to 4 worker processes are concurrently exercising the same `nvrgfx`
  Vulkan device. The crashing test starts ~1 ms after another worker
  finishes a CUDA texture test on the shared device.

## What we need to make this diagnosable

1. **Preserve the crashpad minidump as an artifact** (or disable crashpad
   in CI) so the next occurrence gives us the actual abort message
   instead of `elf_dynamic_array_reader: tag not found` spam. The
   `crash-reports-linux-x86_64-gcc-Debug` upload step already exists —
   we just need to confirm minidumps land in it, and ideally print a
   short crashpad summary to stderr before the worker dies.
2. **Either isolate Vulkan device per pytest-xdist worker, or serialize
   Vulkan-touching tests under a single worker.** Concurrent submissions
   from 4 workers to one shared device is the most plausible failure
   surface and matches the symptom shape — slangpy's test suite already
   silently skips several tests for "race condition doesn't reproduce
   reliably on CI machines of varying specs" (`test_torchbuffers.py:185`)
   and similar reasons, so worker-vs-worker races are an acknowledged
   problem area.
3. **Adopt the retry logic requested in #829** so the next single-test
   abort doesn't take the whole nightly red while the underlying race
   gets investigated. (Until that's in place, every recurrence of this
   flake will take CI red and consume on-call attention to triage.)

## Related issues (not duplicates)

There isn't enough evidence to call this a duplicate of any of these,
but they overlap in mechanism or environment and a triager should be
aware:

- shader-slang/slangpy#929 — Interop buffer cleanup race
  (Vulkan/D3D12 ↔ CUDA). **Same bug class** (race in Vulkan-touching code)
  but **different mechanism** — #929 is a within-process CUDA-imports-Vulkan
  memory lifecycle race; the crashing test here is plain Vulkan with no
  CUDA interop.
- shader-slang/slangpy#827 — Memory leak on non-CUDA backends with PyTorch
  interop. **Plausibly a contributing environment factor** — the slangpy
  suite runs many torch+Vulkan tests in the same pytest session, so VRAM
  pressure could compound by the time this test runs. Not the same bug.

Test suite already silently skips other Vulkan-related flakes:
`test_torchbuffers.py:185`, `test_torchintegration.py:173`,
`test_transforms.py:177` — suggesting this class of intermittent failure
is recognized but not centrally tracked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent abort in test_texture_1d_broadcast[DeviceType.vulkan] under pytest-xdist; recovered on retry #994

Summary

Affected runs

Symptoms

What we need to make this diagnosable

Related issues (not duplicates)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Intermittent abort in test_texture_1d_broadcast[DeviceType.vulkan] under pytest-xdist; recovered on retry #994

Description

Summary

Affected runs

Symptoms

What we need to make this diagnosable

Related issues (not duplicates)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions