[SYCL][Graph] Support for native-command #383

EwanC · 2025-01-31T15:04:27Z

Support sycl_ext_codeplay_enqueue_native_command with SYCL-Graph. Requires oneapi-src/unified-runtime#2656

Introduces interop_handle::ext_oneapi_get_native_graph<backend>() to give the user access to the native graph object which native commands can be appended to.

To use CUDA as an example, code using ext_codeplay_enqueue_native_command eagerly can be updated from:

 CGH.ext_codeplay_enqueue_native_command([=](interop_handle IH) {
       auto NativeStream = IH.get_native_queue<cuda>();
       myNativeLibraryCall(NativeStream);
}

To

CGH.ext_codeplay_enqueue_native_command([=](interop_handle IH) {
     if (IH.ext_oneapi_has_graph())  {
       auto NativeGraph = IH.ext_oneapi_get_native_graph<cuda>();
       auto NativeStream = IH.get_native_queue<cuda>();

       // Start capture stream calls into graph
       cuStreamBeginCaptureToGraph(NativeStream, NativeGraph, nullptr,
                                             nullptr, 0,
                                             CU_STREAM_CAPTURE_MODE_GLOBAL);

       myNativeLibraryCall(NativeStream);

       // Stop capturing stream calls into graph
       cuStreamEndCapture(NativeStream, &NativeGraph);
     } else {
       auto NativeStream = IH.get_native_queue<cuda>();
       myNativeLibraryCall(NativeStream );
    }
}

Example of how this integration could work in GROMACS https://gitlab.com/gromacs/gromacs/-/merge_requests/4954

sycl/test-e2e/Graph/NativeCommand/native_cuda_record_usm.cpp

sycl/doc/extensions/experimental/sycl_ext_codeplay_enqueue_native_command.asciidoc

Adds limited support for device globals in runtime-compiled SYCL code. The application interacts with the globals via three new methods on `kernel_bundle`: ```c++ bool ext_oneapi_has_device_global(const std::string &name); void *ext_oneapi_get_device_global_address(const std::string &name, const device &dev); // return a USM pointer suitable for queue::memcpy etc. size_t ext_oneapi_get_device_global_size(const std::string &name); ``` This PR uses the same trick as intel#16316, i.e. prepending a kernel-bundle-specific prefix to the names of device globals to make them distinguishable for the program manager. Limitations: - Device globals inside a namespace are unsupported due to insufficient name mangling. - Device globals with the `device_image_scope` property cannot be read/written from the host, because the runtime currently cannot expose USM pointers for them. A workaround is using explicit kernels to read/write the global's value into a USM buffer. --------- Signed-off-by: Julian Oppermann <[email protected]>

This is one patch in a series intended to re-organize `add_sycl_rt_library` helper function to make it shorter and simpler. Dropped unnecessary `target_include_directories` calls: - `LIB_NAME` is a link-only target, no need to set any include directories for it - SYCL RT should not be using `unified-runtime/source` or `unified-runtime/source/common` includes

intel#17446) The description of the `build` shorthand uses `ctxt`, despite there being no such argument associated. This commit fixes this by using the devices from the `sourceBundle` argument. Signed-off-by: Larsen, Steffen <[email protected]>

Also remove a cfi exclusive KNOWN_FAILURE as it couldn't be reproduced across a few platforms. closes oneapi-src/unified-runtime#2647

…l#17430) Cherry-pick of: llvm/llvm-project#128513 Fixes: intel#16903 ------ This diagnostic is disabled for device compilation as float128 is not supported on the device side. Other diagnostics are already covering the cases where float128 is actually used in the kernel code, and it's already tested for in the existing test. This is expanding on the patch 318bff6 that handled this for cuda compilation.

Latest distros don't allow global pip installations.

Consolidated all common (i.e. non-conditional) include directories settings for SYCL RT build in a single place.

…#17462)

…tel#17297) Changes: Command_list_manager no longer synchronize its calls, instead the responsibility to ensure exclusivity belongs to the caller. To add synchronization I implemented the mechanism similar to rust lock as suggested in intel#17061 (comment).

Move cts_exclude_filter from `devops` dir to `sycl` and check it out from testing branch.

The SYCL spec mandates that kernel IDs for kernels generated through RTC shouldn't be accessible from the user application. This PR makes it so we filter out the output of `get_kernel_ids` so that we behave according to the spec. Removes some tests that were relying on kernel_ids being accessible from the user application, updates `test_lifetime` to check on debug prints. Refactors `kernel_compiler_sycl_jit_lt.cpp` to remove unnecessary device capability checks and use a common `sycl::queue`.

Add support for timeline semaphores on linux and windows for CUDA backend and preliminary support in L0 backend. Update vulkan_interop/sampled_images.cpp to remove redundant wait when using semaphores.

…ntel#17366) Caller might set CMAKE_MSVC_RUNTIME_LIBRARY to MultiThreaded to do static linking of MSVC runtime. However, UMF is not yet ready for static linking. Override the CMAKE_MSVC_RUNTIME_LIBRARY to fix lnking failures.

Bump UMF to v0.11.0-dev4. From now on disjoint_pool is part of libumf, instead of being a separate library. --------- Signed-off-by: Lukasz Dorau <[email protected]>

`urCommandBufferEnqueueExp` was changed in intel#16984 to `urEnqueueCommandBufferExp` but added back accidentally to the v2 L0 adapter in intel#17297

Support [sycl_ext_codeplay_enqueue_native_command](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_enqueue_native_command.asciidoc) with SYCL-Graph. Introduces `interop_handle::ext_codeplay_get_native_graph<backend>()` to give the user access to the native graph object which native commands can be appended to. To use CUDA as an example, code using `ext_codeplay_enqueue_native_command` eagerly can be updated from: ```cpp CGH.ext_codeplay_enqueue_native_command([=](interop_handle IH) { auto NativeStream = IH.get_native_queue<cuda>(); myNativeLibraryCall(NativeStream); } ``` To ```cpp CGH.ext_codeplay_enqueue_native_command([=](interop_handle IH) { if (IH.ext_codeplay_has_graph()) { auto NativeGraph = IH.ext_codeplay_get_native_graph<cuda>(); auto NativeStream = IH.get_native_queue<cuda>(); // Start capture stream calls into graph cuStreamBeginCaptureToGraph(NativeStream, NativeGraph, nullptr, nullptr, 0, CU_STREAM_CAPTURE_MODE_GLOBAL); myNativeLibraryCall(NativeStream); // Stop capturing stream calls into graph cuStreamEndCapture(NativeStream, &NativeGraph); } else { auto NativeStream = IH.get_native_queue<cuda>(); myNativeLibraryCall(NativeStream ); } } ``` Example of how this integration could work in GROMACS https://gitlab.com/gromacs/gromacs/-/merge_requests/4954

EwanC · 2025-03-17T08:54:58Z

Closing this, upstream PR has now merged intel#16871

EwanC force-pushed the graph_native_enqueue branch from 9836c3c to 5a5aef2 Compare January 31, 2025 16:00

EwanC commented Feb 3, 2025

View reviewed changes

sycl/test-e2e/Graph/NativeCommand/native_cuda_record_usm.cpp Outdated Show resolved Hide resolved

EwanC force-pushed the graph_native_enqueue branch 10 times, most recently from 764aebc to 4e62db9 Compare February 5, 2025 14:13

reble reviewed Feb 6, 2025

View reviewed changes

sycl/doc/extensions/experimental/sycl_ext_codeplay_enqueue_native_command.asciidoc Outdated Show resolved Hide resolved

EwanC force-pushed the graph_native_enqueue branch 4 times, most recently from c07c67a to f499268 Compare February 12, 2025 13:17

EwanC force-pushed the graph_native_enqueue branch 3 times, most recently from b934d72 to c085dc2 Compare February 20, 2025 09:41

EwanC force-pushed the graph_native_enqueue branch 10 times, most recently from a710e4c to d9628c0 Compare March 11, 2025 09:37

EwanC force-pushed the graph_native_enqueue branch 2 times, most recently from 01d273a to c29c1c4 Compare March 14, 2025 08:36

EwanC force-pushed the graph_native_enqueue branch from c29c1c4 to ac21cad Compare March 14, 2025 09:04

AlexeySachkov and others added 2 commits March 14, 2025 10:20

EwanC force-pushed the graph_native_enqueue branch from ac21cad to 3661bfe Compare March 14, 2025 11:27

aarongreig and others added 3 commits March 14, 2025 11:38

[UR] Fix typo in cfi flags handling. (intel#17452)

b2db12a

Also remove a cfi exclusive KNOWN_FAILURE as it couldn't be reproduced across a few platforms. closes oneapi-src/unified-runtime#2647

[UR][CI] Install python deps in venv (intel#17463)

93e147e

Latest distros don't allow global pip installations.

EwanC force-pushed the graph_native_enqueue branch from 3661bfe to 41687a3 Compare March 14, 2025 11:55

AlexeySachkov and others added 15 commits March 14, 2025 15:30

[SYCL][NFC] SYCL RT CMakeLists cleanup [3/N] (intel#17413)

acaa59c

Consolidated all common (i.e. non-conditional) include directories settings for SYCL RT build in a single place.

[CI] Run L0 v2 adapter tests as part of L0 and OCL testing job (intel…

dc2d29b

…#17462)

[CI][CTS] Checkout cts filter from testing branch (intel#17228)

a7b50b2

Move cts_exclude_filter from `devops` dir to `sycl` and check it out from testing branch.

[SYCL][Bindless][UR] Add support for timeline semaphores (intel#17395)

829b30d

Add support for timeline semaphores on linux and windows for CUDA backend and preliminary support in L0 backend. Update vulkan_interop/sampled_images.cpp to remove redundant wait when using semaphores.

[UR] Bump UMF to v0.11.0-dev4 (intel#17468)

a9cb8f1

Bump UMF to v0.11.0-dev4. From now on disjoint_pool is part of libumf, instead of being a separate library. --------- Signed-off-by: Lukasz Dorau <[email protected]>

[UR][NFC] Remove unused function from v2 adapter (intel#17476)

db30646

`urCommandBufferEnqueueExp` was changed in intel#16984 to `urEnqueueCommandBufferExp` but added back accidentally to the v2 L0 adapter in intel#17297

Add preview-breaking-changes macro to ABI breaks

b340233

Iterate on spec and tests

fd132cd

Address PR feedback

ebab034

Move E2E test folder

4e93de3

Address editorial feedback from Gordon

d82bb30

EwanC force-pushed the graph_native_enqueue branch from 41687a3 to d82bb30 Compare March 14, 2025 20:07

Fix v2 code after rebase on tip

8a237a7

EwanC closed this Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][Graph] Support for native-command #383

[SYCL][Graph] Support for native-command #383

Uh oh!

EwanC commented Jan 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

EwanC commented Mar 17, 2025

Uh oh!

Uh oh!

[SYCL][Graph] Support for native-command #383

[SYCL][Graph] Support for native-command #383

Uh oh!

Conversation

EwanC commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EwanC commented Mar 17, 2025

Uh oh!

Uh oh!

EwanC commented Jan 31, 2025 •

edited

Loading