Skip to content

ESIMD DPAS kernel produces wrong results when compiled as part of a large SYCL project (Battlemage/Xe2) #21741

@PMZFX

Description

@PMZFX

ESIMD DPAS kernel produces wrong results when compiled as part of a large SYCL project (Battlemage/Xe2)

Summary

An ESIMD kernel using xmx::dpas intrinsics produces correct results when compiled in a standalone binary or small shared library, but produces incorrect results when the same kernel code is compiled as part of a larger SYCL project with many other SYCL translation units.

The failure pattern is deterministic and stable: the first work-item in the nd_range dimension 0 produces correct output, while all subsequent work-items produce incorrect output. The kernel algorithm itself has been validated extensively in standalone configurations.

Environment

  • GPU: Intel Arc Pro B70 (BMG/Xe2, device ID 0xe223)
  • Driver: libze-intel-gpu1 26.09.37435.1
  • IGC: intel-igc-core-2 2.30.1
  • Compiler: Intel oneAPI DPC++/C++ 2025.3.3 (2025.3.3.20260319)
  • Runtime: Level-Zero V2, driver version 1.14.37435+1
  • OS: Ubuntu 26.04, kernel 7.0.0-12-generic

Kernel description

The kernel implements flash attention using ESIMD DPAS intrinsics. Key characteristics:

  • Uses [[intel::sycl_explicit_simd]]
  • Uses xmx::dpas<8, 8, float> for two matrix multiplications per iteration (QK^T and PV)
  • Template parameter: HEAD_DIM (tested with 64 and 128)
  • Each work-item computes 8 rows of output (DPAS RepeatCount = 8)
  • Uses esimd::block_load, esimd::gather, esimd::block_store, esimd::exp, esimd::convert
  • nd_range<3> grid with dimension 0 = head/batch, dimension 2 = query tiles

Failure pattern

  • Deterministic: work-item 0 in global_id(0) always produces correct output
  • Deterministic: work-items 1+ in global_id(0) always produce incorrect output
  • The incorrect output is not garbage — it looks like plausible but wrong attention values (ERR ~1.3-1.5 vs expected <0.0005)
  • Failure is consistent across runs

What passes

All of the following produce correct results with the same kernel source code:

  1. Standalone test binary — kernel compiled into a small single-purpose executable
  2. Standalone with integrated-style compile flags — same -O3, -fPIC, -DNDEBUG, GGML_BACKEND_BUILD, GGML_BACKEND_SHARED, etc.
  3. Small shared library — kernel in a shared .so, called from a separate executable
  4. Small shared library + 9 additional SYCL translation units — each submitting their own kernels
  5. Small shared library + oneMKL SYCL BLAS linkage
  6. Small shared library with enlarged offload payload — 64+ additional kernel call sites to grow the device image
  7. Standalone with -O0
  8. Standalone with Large GRF mode

What fails

The kernel produces incorrect results in all of these configurations:

  1. Shared library build — kernel compiled as part of a large SYCL backend with ~50+ other SYCL source files
  2. Static library build — same large project, BUILD_SHARED_LIBS=OFF
  3. Dynamic module loading — large SYCL backend loaded via dlopen at runtime
  4. Separate helper shared library — kernel moved into its own .so with its own device image, linked against the large SYCL backend. The ESIMD device image is verified to be in a separate ELF section from the main project's device image.
  5. Queue drain before launch — explicit queue.wait() before submitting the ESIMD kernel, ruling out interaction with prior kernel submissions
  6. Per-file device code split-fsycl-device-code-split=per_kernel and per_source
  7. -O0 in the integrated build
  8. Large GRF in the integrated build
  9. SYCL_ESIMD_FUNCTION annotation
  10. Head-size template splitting — separate translation units per template instantiation

Key observations

  • The failure boundary is not shared-vs-static, not device-image composition (tested with fully separate .so), not queue state, not compile flags alone, and not offload bundle size alone.
  • The failure appears to require the kernel to be part of a build that also compiles a large number of other SYCL translation units (~50+), even if the ESIMD kernel's device image is in a completely separate shared library.
  • We were unable to reproduce the failure in a synthetic mini reproducer with up to ~10 SYCL TUs and enlarged offload payloads. The trigger appears to require the specific real-world project's source composition.
  • The joint_matrix API reports "no matrix hardware" on this device (ext_intel_matrix aspect is false), despite DPAS hardware being present and functional via ESIMD. We used ext_intel_esimd / xmx::dpas as the workaround.

Minimal description of failure

Standalone binary with ESIMD DPAS kernel → PASS (all work-items correct)
Same kernel source, compiled as part of large SYCL project → FAIL (work-item 0 correct, 1+ wrong)
Same kernel source, in separate .so with own device image, loaded by large SYCL project → FAIL

What we cannot provide yet

We have not been able to produce a self-contained minimal reproducer that triggers the failure. The failure only appears when the kernel is built alongside (or loaded from) a process that also uses a large real-world SYCL backend. We are happy to provide more details about the project structure or help test potential fixes if that would be useful.

Versions tested

All testing was done with the versions listed above. We have not yet tested with other compiler or driver versions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions