Skip to content

[SYCL] NULL-deref in getSortedImages comparator on dynamically-linked SYCL kernel libraries #21972

@pshirshov

Description

@pshirshov

NULL-deref in sycl::detail::getSortedImagesstrcmp on entry name without null-check

Summary

sycl::detail::getSortedImages() in persistent_device_code_cache.cpp:99-107 crashes with
SIGSEGV (strcmp(NULL, ...)) when one of the RTDeviceBinaryImage entries it sorts has a
NULL EntriesBegin->GetName(). Triggered reliably when an application loads a SYCL kernel via
a dynamically-linked .so and requests JIT compilation with SYCL_CACHE_PERSISTENT=1 (the
default in many downstreams).

The same code, statically linked, works fine — confirming that the NULL Name is produced
when an RTDeviceBinaryImage is registered from a dlopen'd shared library. The cache
sort lambda assumes every image has a non-empty entry table with a named first entry, which
is not invariant.

Affected source

sycl/source/detail/persistent_device_code_cache.cpp (master branch, also at
ab3dc98de0fd1ada9df12b138de1e1f8b715cc27):

static std::vector<const RTDeviceBinaryImage *>
getSortedImages(const std::vector<const RTDeviceBinaryImage *> &Imgs) {
  std::vector<const RTDeviceBinaryImage *> SortedImgs = Imgs;
  std::sort(SortedImgs.begin(), SortedImgs.end(),
            [](const RTDeviceBinaryImage *A, const RTDeviceBinaryImage *B) {
              // All entry names are unique among these images, so comparing the
              // first ones is enough.
              return std::strcmp(A->getRawData().EntriesBegin->GetName(),
                                 B->getRawData().EntriesBegin->GetName()) < 0;
            });
  return SortedImgs;
}

The comment "All entry names are unique among these images, so comparing the first ones is
enough" relies on an unverified invariant. Encountered in real workloads where
EntriesBegin->GetName() is NULL.

Backtrace (libunwind, from a real crash)

Captured via LD_PRELOAD SIGSEGV handler with libunwind on Intel Arc Pro B70 (BMG-G31,
PCI 8086:e223), GGML SYCL backend at first ggml_sycl_op_get_rows JIT:

=== SIGSEGV pid=63802 addr=(nil) ===
RIP=74aea4eb6bab RDI=8000000000000000 RSI=74ae4d29c490 ...
  #0   _ZSt16__insertion_sort<...>+0xcc                  in libsycl.so.8+0x33816c
  #1   sycl::_V1::detail::getSortedImages(...)+0x17c     in libsycl.so.8+0x33909c
  #2   sycl::_V1::detail::PersistentDeviceCodeCache::
       getItemFromDisc(...)+0xb1                         in libsycl.so.8+0x345751
  #3   ProgramManager::getOrCreateURProgram(...)+0x6b    in libsycl.so.8+0x3075eb
  #4   ProgramManager::getBuiltURProgram::lambda+0x25b   in libsycl.so.8+0x30945b
  #5   ProgramManager::getBuiltURProgram(...)+0xc2c      in libsycl.so.8+0x30b56c
  #6   ProgramManager::getBuiltURProgram(...)+0x3af      in libsycl.so.8+0x313dbf
  #7   ProgramManager::getOrCreateKernel(...)+0x1dc      in libsycl.so.8+0x31459c
  #8-12 sycl::_V1::detail::queue_impl::submit_kernel_*  in libsycl.so.8
  #13  ggml_sycl_op_get_rows+0x2714                      in libggml-sycl.so
  #14  ggml_sycl_get_rows+0x7f                           in libggml-sycl.so
  #15  ggml_backend_sycl_graph_compute+0x767             in libggml-sycl.so
  ...

%rdi and %rdx (strcmp args) are visible in the register dump as 0x0 and a heap pointer —
i.e., strcmp(NULL, valid_string).

The crash address vmovdqu (%rdi),%ymm0 inside __strcmp_avx2 confirms the first-arg
NULL deref.

Reproduction

Environment:

  • libsycl rev ab3dc98de0fd1ada9df12b138de1e1f8b715cc27 (intel-llvm, 2025-11-14)
  • Intel Arc Pro B70 with intel-compute-runtime 26.09, OpenCL backend
    (ONEAPI_DEVICE_SELECTOR=opencl:gpu)
  • GGML SYCL backend from llama.cpp@073bb2c20 (ml/backend/ggml/ggml/src/ggml-sycl/)

Reproducer:

  1. Build llama.cpp's SYCL backend with BUILD_SHARED_LIBS=ON so that libggml-sycl.so is a
    dynamic library (this is what ollama does).
  2. dlopen libggml-sycl.so from a host program (or link a Go runner that
    LD_LIBRARY_PATH-loads it).
  3. Run any inference that triggers GGML_OP_GET_ROWS via SYCL on first decode (e.g. ollama
    running qwen2.5:3b or any model on Battlemage).
  4. Crash on first kernel JIT, before any token output.

Static-link of the same libggml-sycl (e.g., pkg/llama-cpp-sycl/llama-cli with
BUILD_SHARED_LIBS=false) does not crash — same source, same env, same model, same SYCL
runtime. The dynamic-link path is what produces an RTDeviceBinaryImage with
EntriesBegin->GetName() == nullptr.

Workaround

Set SYCL_CACHE_PERSISTENT=0. This bypasses the getItemFromDisc codepath entirely, so
getSortedImages is never called. Verified working: the same workload that crashed at
first decode now generates correct text.

Suggested fix

Make the sort comparator NULL-safe (one approach):

static std::vector<const RTDeviceBinaryImage *>
getSortedImages(const std::vector<const RTDeviceBinaryImage *> &Imgs) {
  std::vector<const RTDeviceBinaryImage *> SortedImgs = Imgs;
  std::sort(SortedImgs.begin(), SortedImgs.end(),
            [](const RTDeviceBinaryImage *A, const RTDeviceBinaryImage *B) {
              const auto *NameA = A->getRawData().EntriesBegin
                                ? A->getRawData().EntriesBegin->GetName()
                                : nullptr;
              const auto *NameB = B->getRawData().EntriesBegin
                                ? B->getRawData().EntriesBegin->GetName()
                                : nullptr;
              if (NameA == NameB) return false;          // both NULL or same ptr
              if (NameA == nullptr) return true;         // NULLs sort first
              if (NameB == nullptr) return false;
              return std::strcmp(NameA, NameB) < 0;
            });
  return SortedImgs;
}

Better still: assert at registration time that every RTDeviceBinaryImage registered via
__sycl_register_lib has a non-empty entry table with a named first entry, OR explicitly
filter out images with EntriesBegin == EntriesEnd before passing to
getSortedImages so the cache key is well-defined.

Ideally, also identify why __sycl_register_lib accepts an image with a NULL entry name in
the first place — that's likely the root invariant violation produced by the
-fsycl driver pipeline when generating SPIR-V images for a shared object. The downstream
crash signature is the symptom; the upstream codegen path (which produces the empty/null
entry image inside a .so) is the root cause.

Reproducibility notes

  • Crash is 100% reliable with SYCL_CACHE_PERSISTENT=1 (default).
  • Crash does NOT happen with static link, even with the same kernel sources, same flags, and
    same SYCL runtime.
  • Wiping the on-disk cache directory (~/.cache/libsycl_cache) does not help — the bug is in
    the in-memory vector<RTDeviceBinaryImage*> sort, not in cached files.
  • The SYCL runtime first calls getItemFromDisc (cache lookup) on the image vector before
    doing fresh JIT; the lookup needs the sorted vector to construct the cache key. So the
    crash hits BEFORE any cache I/O, regardless of cache contents.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions