Skip to content

Intermittent abort in test_texture_1d_broadcast[DeviceType.vulkan] under pytest-xdist; recovered on retry #994

@jvepsalainen-nv

Description

@jvepsalainen-nv

Summary

The scheduled ci-latest-slang "Slang branch: master" nightly on 2026-05-22 failed
in build (linux, Debug, 3.10) when pytest-xdist worker gw1 aborted
(Fatal Python error: Aborted, SIGABRT) while running
slangpy/tests/slangpy_tests/test_textures.py::test_texture_1d_broadcast[DeviceType.vulkan].
A manual re-run of the same workflow on the same slang sha passed, and a
follow-up bisect against the suspect slang commit (#shader-slang/slang#11110)
and its parent both passed on the same matrix. So this is non-deterministic;
it's a flake, not a slang regression.

Affected runs

The exact crashing test (test_texture_1d_broadcast[DeviceType.vulkan])
ran and passed on both bisect runs. So three data points now agree that
slang master is not at fault.

Symptoms

  • 3682 passed, 330 skipped, 1 failed in the affected run.
  • Sole failure: worker gw1 crashed mid-test; pytest-xdist replaced it
    and continued. No assertion message — the abort was native.
  • The actual abort reason is not in the log: the worker was running
    with crashpad active, which produced hundreds of
    [ERROR elf_dynamic_array_reader.h:64] tag not found lines while
    symbolizing the crash, and consumed whatever stderr message the
    aborting code wrote. The Python stack at the top of the dump is just
    the test calling into nanobind; we can't see what the C++ side aborted on.
  • The python suite is invoked as pytest -n auto --maxprocesses=4, so up
    to 4 worker processes are concurrently exercising the same nvrgfx
    Vulkan device. The crashing test starts ~1 ms after another worker
    finishes a CUDA texture test on the shared device.

What we need to make this diagnosable

  1. Preserve the crashpad minidump as an artifact (or disable crashpad
    in CI) so the next occurrence gives us the actual abort message
    instead of elf_dynamic_array_reader: tag not found spam. The
    crash-reports-linux-x86_64-gcc-Debug upload step already exists —
    we just need to confirm minidumps land in it, and ideally print a
    short crashpad summary to stderr before the worker dies.
  2. Either isolate Vulkan device per pytest-xdist worker, or serialize
    Vulkan-touching tests under a single worker.
    Concurrent submissions
    from 4 workers to one shared device is the most plausible failure
    surface and matches the symptom shape — slangpy's test suite already
    silently skips several tests for "race condition doesn't reproduce
    reliably on CI machines of varying specs" (test_torchbuffers.py:185)
    and similar reasons, so worker-vs-worker races are an acknowledged
    problem area.
  3. Adopt the retry logic requested in Add similar retry logic than in Slang SlangPy tests #829 so the next single-test
    abort doesn't take the whole nightly red while the underlying race
    gets investigated. (Until that's in place, every recurrence of this
    flake will take CI red and consume on-call attention to triage.)

Related issues (not duplicates)

There isn't enough evidence to call this a duplicate of any of these,
but they overlap in mechanism or environment and a triager should be
aware:

Test suite already silently skips other Vulkan-related flakes:
test_torchbuffers.py:185, test_torchintegration.py:173,
test_transforms.py:177 — suggesting this class of intermittent failure
is recognized but not centrally tracked.

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions