[SYCL][Graph] Initial implementation of graph-owned device allocations #385

Bensuo · 2025-04-10T15:23:42Z

Allocations managed via new graph_mem_pool
Device allocations use virtual memory
Intercept async_alloc calls when adding nodes to graph
New tests for functionality

In `ProgramManager::removeImages`, we cleanup our KernelName2KernelID mapping, before using that exact mapping retrieve KernelIDs in order to clean up our KernelIDs2BinImage mapping. This PR cleans up `m_KernelID2BinImage` mapping before cleaning up `m_KernelName2KernelIDs` maping. This is inteded to fix a hit raised by Coverity.

…-g is passed. (intel#17987) Reverts intel#16408

This fixes a regression after intel#17876

This patch fixes up and enable memory pools for the HIP adapter, it is based on oneapi-src/unified-runtime#1689 and on the CUDA adapter implementation. The initial patch had segmentation faults in the CI that we couldn't reproduce locally. That happened as well in this patch and I couldn't reproduce the segfaults locally either. However I noticed that it failed in `urUSMHostAlloc`, and that entry point was different from the CUDA adapter version, in that the HIP adapter was using a "helper" function. It turns out that the helper function was using a device pool instead of a host pool to do the allocation, which seemed obviously wrong. Replacing the helper by similar code used in the CUDA adapter fixes the crash in the CI.

Follow up from intel#17931, additional fix for URT-903.

…l#17850) This ensures that functions have the right linkage. Several functions are marked as used to prevent them from being removed as dead code before the work item loop pass and `PrepareSYCLNativeCPUPass` run.

`urDeviceRetain` and `urDeviceRelease` are no-ops, so this patch removes all uses. Uses of it in the context will be removed in intel#17571

Pool tracking is now always enabled for UMF and cannot be disabled, so we can remove these ifdefs.

EwanC · 2025-04-14T13:41:04Z

sycl/source/detail/async_alloc.cpp

+  if (auto Graph = h.getCommandGraph(); Graph) {
+    // Check if the pointer to be freed has an associated allocation node, and
+    // error if not
+    if (!Graph->getMemPool().hasAllocation(ptr)) {


Can we add a section to the design doc on the async malloc/free impl. Doesn't need to be a lot, but could note things like free being a no-op as we don't do memory reallocation yet.

I've added some wording for these now, thanks!

sycl/source/detail/graph_impl.hpp

EwanC · 2025-04-14T13:46:49Z

sycl/source/detail/graph_impl.hpp

+
+  // The number of live executable graphs that have been created from this
+  // modifiable graph
+  size_t MExecGraphCount = 0;


should this be atomic?

Also, It's also interesting that the spec for finalize error says to throw if "the graph has previously been finalized."
not just to throw if you finalize while there is already an alive executable graph.

Also, It's also interesting that the spec for finalize error says to throw if "the graph has previously been finalized." not just to throw if you finalize while there is already an alive executable graph.

Ah good catch, that was some old wording, I've updated the spec now to reflect the correct behaviour as shown here.

should this be atomic?

Hmmm, maybe it should be yeah. I was thinking the locks we have would be enough to avoid issues but actually because the decrement of this count is called from a constructor we can't use a lock. I think making it atomic will make it thread safe but with the usual caveats that ordering is not guaranteed. AKA if a user is destroying an executable graph and simultaneously trying to finalize a new one from different threads we can't guarantee a correct ordering in that case.

sycl/source/detail/graph_memory_pool.hpp

This patch should essentially be NFC. The way it's currently setup is still very hacky and could use further refactoring, but this patch just isolates the hack a little bit and makes it more obvious what's going on, as well as removing some code duplication.

@Pennycook

… add respective warning (intel#17947) Currently some of our docs suggest using `get(properties_tag) {...}` member in kernel functors to specify kernel properties while some suggest using `get(properties_tag) const {...}`. But actually `get(properties_tag) {...}` might not work under certain conditions, so we should update the incorrect docs that they suggest readers to use `get(properties_tag) const {...}` rather than `get(properties_tag) {...}`. Also as @Pennycook suggested, added a diagnostic warning to indicate the user when: -they have declared a `get(properties_tag)` member in the kernel functor AND -they declared `get(properties_tag){...}` rather than `get(properties_tag) const {...}` Updated the tests containing the non-const version of `get(properties_tag)` as well. --------- Signed-off-by: Hu, Peisen <[email protected]>

The workflow was renamed.

The other change related to increasing the sizes is due to the fact that stride has to be > 64 bytes in order for 2d block read/write is correctly generated.

We need to set `cuda-path`/`hip-path` on Windows as the autodetection isn't implemented upstream, it seems only implemented for Linux. Also fix a double assign and a missed update in `E2EExpr` This was tested with some other pending changes [here](https://github.com/intel/llvm/actions/runs/14389689524/job/40353640119?pr=14114) and [here](https://github.com/intel/llvm/actions/runs/14389689487/job/40353627543?pr=14114). Signed-off-by: Sarnie, Nick <[email protected]>

) Several functions were being defined both in libclc and in libdevice, but had different definitions between the two. This commit changes it so that they are only defined once, in libclc, with the definition that we had in libdevice that we were relying on taking precedence.

- Fix barrier wait cleanup of events given an out event such that an additional refcnt release is done and with multi command lists and in order, the active barrier is properly cleared before reassignment. Signed-off-by: Neil R. Spruit <[email protected]>

cleanup UMF build flags: UMF_ENABLE_POOL_TRACKING and UMF_BUILD_LIBUMF_POOL_DISJOINT have been removed from the latest UMF. The features that were previously enabled by these flags are now always enabled.

I fixed the LIT infra so this works now. AMDGPU one is coming later. Signed-off-by: Sarnie, Nick <[email protected]>

…k dashboard via sycl-docs.yml (intel#17522) This PR: - Introduces new CI to replace existing nightly benchmarking workflow - Adds a preliminary implementation of a comparison script alongside the existing benchmarking scripts: This comparison script is used by the CI to decide whether or not a benchmark result is over a (predefined) tolerance above historical median of benchmark results This is not a permanent solution: I am looking into exponential moving averages in the future - Adds code in sycl-docs.yml for bringing up new benchmark results dashboard on https://intel.github.io --------- Co-authored-by: Piotr Balcer <[email protected]> Co-authored-by: Łukasz Stolarczuk <[email protected]>

On the application side, kernel names can be retrieved as a const char* from the integration header or built-ins. On the library side, they are retrieved from the offload entries. With the recent introduction of the __sycl_unregister_lib implementation, there shouldn't be any need to store copies of those strings anymore.

…est (intel#17691) Replace the following e2e tests with unit test: ``` ProgramManager/multi_device_bundle/device_libs_and_caching.cpp ProgramManager/multi_device_bundle/build_twice.cpp KernelAndProgram/persistent-cache-multi-device.cpp ``` The reason is that the checks performed in those e2e tests (number of UR calls, etc.) can be done more effificently using unit tests, such checks in e2e tests are expensive. Another reason is that e2e tests use NEO-specific `CreateMultipleRootDevices=[num]` environment variable which is not guaranteed to work, in unit tests we can emulate as many devices as we want.

Follows what I started in intel#17838.

Logically, it affects subsequent `%{run}` lines so must be done in run-mode, not in build-mode.

Neil suggested we do this, and the prerelease versions probably won't have binaries anyway so trying to use the artifact will fail. Formatter made me fix some unrelated stuff too. --------- Signed-off-by: Sarnie, Nick <[email protected]>

* Move copyright notices out of file `__doc__` strings and into comments * Move function `__doc__` strings into functions

After intel#18015 `wrap_kernel` is powerful enough to substitute it with just a minor tweak.

Remove the legacy implementation of emulation for OoO queues. This code predated the PI and UR layers in DPC++ because fpga OpenCL did not support OoO queues. Today, this type of thing should be done by the backend compiler or UR layer. --------- Signed-off-by: James Brodman <[email protected]>

Level Zero had a new full-quality release and it has binaries so we can use it now. Also, we've had this issue forever: ``` The following packages have unmet dependencies: intel-igc-cm : Depends: intel-igc-core (>= 1.0.9995) but it is not installabl ``` The problem was the prerelease version of CM has bad dependencies, use the latest released version which works fine, and we don't really use CM anyway. Formatter made me fix some unrelated areas again. --------- Signed-off-by: Sarnie, Nick <[email protected]>

Sigh Signed-off-by: Sarnie, Nick <[email protected]>

`std::ignore` seems to have been adopted as some sort of ersatz `(void)` cast or `[[maybe_unused]]` annotation on unused parameters. Since the following are true: 1. The behavior of `std::ignore` outside of `std::tie` is not formally specified; 2. C++ already has builtin ways to express ignorance of a value; 3. Compiling `std::tie` is much more expensive than using the builtin alternatives [1]; 4. It's confusing to read; 5. It requires an extra header inclusion: I decided to simply omit the argument name in all cases where this is possible. This is a mostly completely mechanical transformation, using the following script: ``` find unified-runtime -name "*.[ch]pp" \ | xargs sed -i '/^\s*std::ignore\s*=\s*[a-zA-Z_][a-zA-Z0-9_]*;/d' \ && git ls-files -m | parallel clang-tidy \ '--checks="-*,misc-unused-parameters"' --fix --fix-errors -p build ``` Followed by manually introducting `[[maybe_unused]]` in `NDEBUG` blocks in a few places. [1]: Both clang and gcc with libstdc++ and libc++ show similar slowdowns compiling the std::ignore vs void. I didn't try with MSVC but I see no reason it wouldn't be similar ``` $ ipython3 In [1]: def ignore(n): ...: ignores = 'std::ignore = x;\n' * n ...: return f'''#include <tuple>\nint foo(int x) {{ {ignores} \n return x;}}''' ...: In [2]: def void(n): ...: voids = '(void)x;\n' * n ...: return f'''int foo(int x) {{ {voids} \n return x;\n}}''' ...: In [3]: with open("ignore.cpp", "w") as f: ...: f.write(ignore(100000)) ...: In [4]: with open("voids.cpp", "w") as f: ...: f.write(void(100000)) ...: $ time c++ -S voids.cpp c++ -S voids.cpp 0.18s user 0.03s system 99% cpu 0.211 total $ time c++ -S ignore.cpp c++ -S ignore.cpp 4.50s user 0.34s system 99% cpu 4.836 total ```

I manually tested it and it finally works this time. Signed-off-by: Sarnie, Nick <[email protected]>

handler's secondary queue is not used for any meaningful work.

…ion (intel#18023) As agreement in intel#16729 , In general we discourage the use of the extension strings as it is a remnant of the OpenCL days. Instead we want to use urDeviceGetInfo when possible. This is a follow-up pr to replace checking OpenCL extension string with urGetDeviceInfo for native bfloat16 conversion extension. --------- Signed-off-by: jinge90 <[email protected]> Co-authored-by: Steffen Larsen <[email protected]>

…ed diagnostics if kernel violates restrictions (intel#17954) This PR adds test for free function kernels extension that checks emitted diagnostics if kernel violates restrictions specified in spec for declaration of free function kernel. https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_free_function_kernels.asciidoc#defining-a-free-function-kernel Test introduced based on test plan: intel#17886

…tel#18026) * Disable inserting unreachable instruction since we have another mechanism to exit current kernel execution. * Cleanup clean shadow after it's allocated. * Remove redzone maximum limitation.

Change intel#17095 introduced a failure in the logger_validation test due to an incorrect merging of the default and "current" log level. This change reverts to the old behavior where an invalid `output` value causes the logger to use the default `level`.

…ntel#18049) The latest asan libdevice source triggers a compiler warning: llvm/libdevice/sanitizer/asan_rtl.cpp:342:3: warning: label at end of compound statement is a C++23 extension [-Wc++23-extensions] This trivial pr removes it. Signed-off-by: jinge90 <[email protected]>

…18029) Linear IDs (the `0` in `opencl:0`) did not function as intended when there was multiple platforms; each platform started counting at zero. This patch changes it so that they start counting at an appropriate number that should match `urinfo`. `urinfo` itself has also been updated to not display linear IDs when the device list is being filtered, as they will not be correct. This matches the behavior of sycl-ls and is different to the `--no-linear-ids` which changes the format of the output.

And also disable aws-stop in sycl-rel nightly if aws-start was skipped.

… message (intel#17990) Based on the user's feedback in intel#17957 and intel#17914 --------- Co-authored-by: Steffen Larsen <[email protected]>

Follow up of intel#17967 Secondary submission queue is optional in SYCL Spec and it has been decided to remove it.

…ntel#18044) None of the E2E tests fail in "preview" mode so the functionality hasn't even been tested for awhile. Separate the code into a separate legacy helper to avoid touching in future refactorings for the up-to-date code.

…#18065) No value in thin wrappers requiring extra instantiation for every kernel.

Fix cts segment fault on accessing shared usm

Also set default name to nameless globals since sanitizer cannot handle nameless globals.

…tel#17929) Removes the remaining fusion implementation and refactors the `sycl-jit` library to better capture the current use-cases: SYCL runtime compilation (RTC), and specialization constant materialization (SCM). High-level notes: - Split the entrypoints for RTC, SCM and option handling, formerly in `KernelFusion[.h|.cpp]`, into separate files - A significant amount of infrastructure became unused after removing the fusion implementation. The remaining facilities are: - CMake setup to link LLVM libraries and produce a pass plugin library - Translation from LLVM module to SPIRV and PTX/AMDGCN (loading and linking multiple SPIRV or LLVM modules was unused and removed) - `JITContext` class owning "binaries" (could be SPIRV blob or PTX/AMDGCN text). At the interface, we use `JITBinaryInfo` (fka `SYCLKernelBinaryInfo`), which is a non-owning descriptor type - Option handling - Dropped "fusion" terminology, and mostly dropped "kernel" terminology, because RTC and SCM deal with device images (runtime-side) respectively LLVM modules (jit-side). One exception is the translation helper, which optionally takes a kernel name, which is in turned used by the PTX/AMDGCN translation to obtain target attributes from the corresponding kernel function. - The `sycl-jit-common` library (in `sycl-jit/common` directory) was removed because it became header-only and the headers were no longer used by the pass library (so could be moved to `sycl-jit/jit-compiler/include`) --------- Signed-off-by: Julian Oppermann <[email protected]> Co-authored-by: Steffen Larsen <[email protected]>

…8055) This fixes debug builds of UR with CUDA enabled; in that configuration the `test-adapter-cuda` emits a call to `umf::umf2urResult` which requires an implementaiton of `umf::getProviderNativeError`.

…efore calling func (intel#17681) Its PR migrated from oneapi-src/unified-runtime#2746 It looks like a common typo when writing functions. Fixed functions: - urEnqueueMemBufferWriteRect() - urEnqueueMemBufferReadRect() Co-authored-by: Nicolas Miller <[email protected]>

- Allocations managed via new graph_mem_pool - Device allocations use virtual memory - Intercept async_alloc calls when adding nodes to graph - New tests for functionality - Design doc information about implementation

ianayl and others added 7 commits April 11, 2025 12:55

[SYCL][Driver] Revert -O0 as the default SYCL device optimization if …

92121c9

…-g is passed. (intel#17987) Reverts intel#16408

[CI][Benchmarks] fix compute-benchmarks compilation (intel#17993)

b1051b6

This fixes a regression after intel#17876

[UR] Retain and release device handles for sub-devices (intel#17977)

02d4a8d

Follow up from intel#17931, additional fix for URT-903.

[UR] Move native cpu device code cts skip to LoadSource. (intel#17973)

f531ef3

Bensuo force-pushed the ben/graph_async_alloc_impl branch from 47c4a78 to 8b8f12e Compare April 14, 2025 13:08

[UR][CUDA][HIP] Remove unnecessary device retain/release (intel#17937)

c11bb9c

`urDeviceRetain` and `urDeviceRelease` are no-ops, so this patch removes all uses. Uses of it in the context will be removed in intel#17571

Bensuo force-pushed the ben/graph_async_alloc_impl branch from 8b8f12e to 0d47482 Compare April 14, 2025 13:23

[UR][CUDA][HIP] Remove umf pool tracking ifdefs (intel#17997)

81b09d0

Pool tracking is now always enabled for UMF and cannot be disabled, so we can remove these ifdefs.

EwanC reviewed Apr 14, 2025

View reviewed changes

npmiller and others added 5 commits April 14, 2025 15:04

[CI] Fix the workflow name (intel#17994)

3543603

The workflow was renamed.

[SYCL][E2E][JM] Enable tests passing on PVC (intel#17932)

3758172

The other change related to increasing the sizes is due to the fact that stride has to be > 64 bytes in order for 2d block read/write is correctly generated.

Bensuo force-pushed the ben/graph_async_alloc_impl branch 2 times, most recently from f22eba9 to 39caa11 Compare April 14, 2025 16:55

hvdijk and others added 11 commits April 14, 2025 18:01

[UR][UMF] Cleanup UMF build flags (intel#18003)

2cbb3da

cleanup UMF build flags: UMF_ENABLE_POOL_TRACKING and UMF_BUILD_LIBUMF_POOL_DISJOINT have been removed from the latest UMF. The features that were previously enabled by these flags are now always enabled.

[SYCL][E2E] Enable is_compatible_nvptx64.cpp on Windows (intel#18008)

384025f

I fixed the LIT infra so this works now. AMDGPU one is coming later. Signed-off-by: Sarnie, Nick <[email protected]>

[SYCL][NFCI] More refactoring around "kernel wrapping" (intel#18015)

db8fda2

Follows what I started in intel#17838.

[SYCL][E2E] Use %{run-aux} when removing %t/cache_dir (intel#18014)

858539a

Logically, it affects subsequent `%{run}` lines so must be done in run-mode, not in build-mode.

[NFC][UR] Cleanup Python generator scripts (intel#18006)

6c291d2

* Move copyright notices out of file `__doc__` strings and into comments * Move function `__doc__` strings into functions

Bensuo force-pushed the ben/graph_async_alloc_impl branch from fe51208 to f93ce75 Compare April 15, 2025 16:09

[SYCL][NFCI] Remove single_task_lambda_impl helper (intel#18020)

97d56c1

After intel#18015 `wrap_kernel` is powerful enough to substitute it with just a minor tweak.

Bensuo force-pushed the ben/graph_async_alloc_impl branch from f93ce75 to 5638e1b Compare April 15, 2025 16:34

jbrodman and others added 13 commits April 15, 2025 18:36

[CI] Revert Level Zero bump yet again (intel#18040)

2c7aa07

Sigh Signed-off-by: Sarnie, Nick <[email protected]>

[CI] Hopefully successfully update Level Zero to 1.21.9 (intel#18043)

51ce490

I manually tested it and it finally works this time. Signed-off-by: Sarnie, Nick <[email protected]>

[SYCL] Remove secondary submission queue from handler (intel#17967)

08d11bc

handler's secondary queue is not used for any meaningful work.

[SYCL] Optimize handler::setNDRangeDescriptor (intel#17969)

a7f5e93

[DevSanitizer] Fix several misc issues in device memory sanitizer (in…

c71d4ca

…tel#18026) * Disable inserting unreachable instruction since we have another mechanism to exit current kernel execution. * Cleanup clean shadow after it's allocated. * Remove redzone maximum limitation.

Bensuo force-pushed the ben/graph_async_alloc_impl branch from 5638e1b to 306b5a2 Compare April 16, 2025 10:57

KornevNikita and others added 11 commits April 16, 2025 16:21

[CI] Disable some workflows in forks (intel#18038)

b6af150

And also disable aws-stop in sycl-rel nightly if aws-start was skipped.

[Clang][Doc][NFC] Improve -offload-compress documentation and error…

d466092

… message (intel#17990) Based on the user's feedback in intel#17957 and intel#17914 --------- Co-authored-by: Steffen Larsen <[email protected]>

[SYCL] Remove secondary queue from submit info DS (intel#18045)

37fd1cc

Follow up of intel#17967 Secondary submission queue is optional in SYCL Spec and it has been decided to remove it.

[SYCL][NFCI] Use wrap_kernel directly in the user-facing API (intel…

6d64961

…#18065) No value in thin wrappers requiring extra instantiation for every kernel.

[DeviceMSAN] Use device usm to allocate LaunchInfo (intel#17948)

8f7792b

Fix cts segment fault on accessing shared usm

[DevTSAN] Disable checks for vtable and atomics (intel#18074)

e1f523c

Also set default name to nameless globals since sanitizer cannot handle nameless globals.

[UR] Implement getProviderNativeError in CUDA adptr. tests (intel#1…

018679f

…8055) This fixes debug builds of UR with CUDA enabled; in that configuration the `test-adapter-cuda` emits a call to `umf::umf2urResult` which requires an implementaiton of `umf::getProviderNativeError`.

Bensuo force-pushed the ben/graph_async_alloc_impl branch from 306b5a2 to 3d6204f Compare April 17, 2025 11:13

Bensuo closed this Apr 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][Graph] Initial implementation of graph-owned device allocations #385

[SYCL][Graph] Initial implementation of graph-owned device allocations #385

Bensuo commented Apr 10, 2025

EwanC Apr 14, 2025

Bensuo Apr 17, 2025

EwanC Apr 14, 2025

Bensuo Apr 15, 2025

[SYCL][Graph] Initial implementation of graph-owned device allocations #385

[SYCL][Graph] Initial implementation of graph-owned device allocations #385

Conversation

Bensuo commented Apr 10, 2025

EwanC Apr 14, 2025

Choose a reason for hiding this comment

Bensuo Apr 17, 2025

Choose a reason for hiding this comment

EwanC Apr 14, 2025

Choose a reason for hiding this comment

Bensuo Apr 15, 2025

Choose a reason for hiding this comment