Skip to content

[SYCL][Graph] Support for native-command #383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 27 commits into from
Closed

Conversation

EwanC
Copy link
Collaborator

@EwanC EwanC commented Jan 31, 2025

Support sycl_ext_codeplay_enqueue_native_command with SYCL-Graph. Requires oneapi-src/unified-runtime#2656

Introduces interop_handle::ext_oneapi_get_native_graph<backend>() to give the user access to the native graph object which native commands can be appended to.

To use CUDA as an example, code using ext_codeplay_enqueue_native_command eagerly can be updated from:

 CGH.ext_codeplay_enqueue_native_command([=](interop_handle IH) {
       auto NativeStream = IH.get_native_queue<cuda>();
       myNativeLibraryCall(NativeStream);
}

To

CGH.ext_codeplay_enqueue_native_command([=](interop_handle IH) {
     if (IH.ext_oneapi_has_graph())  {
       auto NativeGraph = IH.ext_oneapi_get_native_graph<cuda>();
       auto NativeStream = IH.get_native_queue<cuda>();

       // Start capture stream calls into graph
       cuStreamBeginCaptureToGraph(NativeStream, NativeGraph, nullptr,
                                             nullptr, 0,
                                             CU_STREAM_CAPTURE_MODE_GLOBAL);

       myNativeLibraryCall(NativeStream);

       // Stop capturing stream calls into graph
       cuStreamEndCapture(NativeStream, &NativeGraph);
     } else {
       auto NativeStream = IH.get_native_queue<cuda>();
       myNativeLibraryCall(NativeStream );
    }
}

Example of how this integration could work in GROMACS https://gitlab.com/gromacs/gromacs/-/merge_requests/4954

@EwanC EwanC force-pushed the graph_native_enqueue branch from 9836c3c to 5a5aef2 Compare January 31, 2025 16:00
@EwanC EwanC force-pushed the graph_native_enqueue branch 10 times, most recently from 764aebc to 4e62db9 Compare February 5, 2025 14:13
@EwanC EwanC force-pushed the graph_native_enqueue branch 4 times, most recently from c07c67a to f499268 Compare February 12, 2025 13:17
@EwanC EwanC force-pushed the graph_native_enqueue branch 3 times, most recently from b934d72 to c085dc2 Compare February 20, 2025 09:41
@EwanC EwanC force-pushed the graph_native_enqueue branch 10 times, most recently from a710e4c to d9628c0 Compare March 11, 2025 09:37
@EwanC EwanC force-pushed the graph_native_enqueue branch 2 times, most recently from 01d273a to c29c1c4 Compare March 14, 2025 08:36
Adds limited support for device globals in runtime-compiled SYCL code.
The application interacts with the globals via three new methods on
`kernel_bundle`:

```c++
bool ext_oneapi_has_device_global(const std::string &name);
void *ext_oneapi_get_device_global_address(const std::string &name, const device &dev); // return a USM pointer suitable for queue::memcpy etc.
size_t ext_oneapi_get_device_global_size(const std::string &name);
```

This PR uses the same trick as intel#16316, i.e. prepending a
kernel-bundle-specific prefix to the names of device globals to make
them distinguishable for the program manager.

Limitations:
- Device globals inside a namespace are unsupported due to insufficient
name mangling.
- Device globals with the `device_image_scope` property cannot be
read/written from the host, because the runtime currently cannot expose
USM pointers for them. A workaround is using explicit kernels to
read/write the global's value into a USM buffer.

---------

Signed-off-by: Julian Oppermann <[email protected]>
@EwanC EwanC force-pushed the graph_native_enqueue branch from c29c1c4 to ac21cad Compare March 14, 2025 09:04
AlexeySachkov and others added 2 commits March 14, 2025 10:20
This is one patch in a series intended to re-organize
`add_sycl_rt_library` helper function to make it shorter and simpler.

Dropped unnecessary `target_include_directories` calls:
- `LIB_NAME` is a link-only target, no need to set any include
directories for it
- SYCL RT should not be using `unified-runtime/source` or
`unified-runtime/source/common` includes
intel#17446)

The description of the `build` shorthand uses `ctxt`, despite there
being no such argument associated. This commit fixes this by using the
devices from the `sourceBundle` argument.

Signed-off-by: Larsen, Steffen <[email protected]>
@EwanC EwanC force-pushed the graph_native_enqueue branch from ac21cad to 3661bfe Compare March 14, 2025 11:27
aarongreig and others added 3 commits March 14, 2025 11:38
Also remove a cfi exclusive KNOWN_FAILURE as it couldn't be reproduced
across a few platforms.

closes oneapi-src/unified-runtime#2647
…l#17430)

Cherry-pick of: llvm/llvm-project#128513

Fixes: intel#16903

------
This diagnostic is disabled for device compilation as float128 is not
supported on the device side.

Other diagnostics are already covering the cases where float128 is
actually used in the kernel code, and it's already tested for in the
existing test.

This is expanding on the patch 318bff6 that handled this for cuda
compilation.
Latest distros don't allow global pip installations.
@EwanC EwanC force-pushed the graph_native_enqueue branch from 3661bfe to 41687a3 Compare March 14, 2025 11:55
AlexeySachkov and others added 15 commits March 14, 2025 15:30
Consolidated all common (i.e. non-conditional) include directories
settings for SYCL RT build in a single place.
…tel#17297)

Changes:
Command_list_manager no longer synchronize its calls, instead the
responsibility to ensure exclusivity belongs to the caller.
To add synchronization I implemented the mechanism similar to rust lock
as suggested in
intel#17061 (comment).
Move cts_exclude_filter from `devops` dir to `sycl` and check it out
from testing branch.
The SYCL spec mandates that kernel IDs for kernels generated through RTC
shouldn't be accessible from the user application. This PR makes it so
we filter out the output of `get_kernel_ids` so that we behave according
to the spec.
Removes some tests that were relying on kernel_ids being accessible from
the user application, updates `test_lifetime` to check on debug prints.
Refactors `kernel_compiler_sycl_jit_lt.cpp` to remove unnecessary device
capability checks and use a common `sycl::queue`.
Add support for timeline semaphores on linux and windows for CUDA
backend and preliminary support in L0 backend. Update
vulkan_interop/sampled_images.cpp to remove redundant wait when using
semaphores.
…ntel#17366)

Caller might set CMAKE_MSVC_RUNTIME_LIBRARY to MultiThreaded to do
static linking of MSVC runtime. However, UMF is not yet ready for static
linking. Override the CMAKE_MSVC_RUNTIME_LIBRARY to fix lnking failures.
Bump UMF to v0.11.0-dev4.
From now on disjoint_pool is part of libumf,
instead of being a separate library.

---------

Signed-off-by: Lukasz Dorau <[email protected]>
`urCommandBufferEnqueueExp` was changed in
intel#16984 to `urEnqueueCommandBufferExp`
but added back accidentally to the v2 L0 adapter in
intel#17297
Support [sycl_ext_codeplay_enqueue_native_command](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_enqueue_native_command.asciidoc) with SYCL-Graph.

Introduces `interop_handle::ext_codeplay_get_native_graph<backend>()` to
give the user access to the native graph object which native commands
can be appended to.

To use CUDA as an example, code using `ext_codeplay_enqueue_native_command`
eagerly can be updated from:

```cpp
 CGH.ext_codeplay_enqueue_native_command([=](interop_handle IH) {
       auto NativeStream = IH.get_native_queue<cuda>();
       myNativeLibraryCall(NativeStream);
}
```

To

```cpp
CGH.ext_codeplay_enqueue_native_command([=](interop_handle IH) {
     if (IH.ext_codeplay_has_graph())  {
       auto NativeGraph = IH.ext_codeplay_get_native_graph<cuda>();
       auto NativeStream = IH.get_native_queue<cuda>();

       // Start capture stream calls into graph
       cuStreamBeginCaptureToGraph(NativeStream, NativeGraph, nullptr,
                                             nullptr, 0,
                                             CU_STREAM_CAPTURE_MODE_GLOBAL);

       myNativeLibraryCall(NativeStream);

       // Stop capturing stream calls into graph
       cuStreamEndCapture(NativeStream, &NativeGraph);
     } else {
       auto NativeStream = IH.get_native_queue<cuda>();
       myNativeLibraryCall(NativeStream );
    }
}
```

Example of how this integration could work in GROMACS https://gitlab.com/gromacs/gromacs/-/merge_requests/4954
@EwanC EwanC force-pushed the graph_native_enqueue branch from 41687a3 to d82bb30 Compare March 14, 2025 20:07
@EwanC
Copy link
Collaborator Author

EwanC commented Mar 17, 2025

Closing this, upstream PR has now merged intel#16871

@EwanC EwanC closed this Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.