Skip to content

[SYCL][Graph] Initial implementation of graph-owned device allocations #385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 62 commits into from

Conversation

Bensuo
Copy link
Collaborator

@Bensuo Bensuo commented Apr 10, 2025

  • Allocations managed via new graph_mem_pool
  • Device allocations use virtual memory
  • Intercept async_alloc calls when adding nodes to graph
  • New tests for functionality

ianayl and others added 7 commits April 11, 2025 12:55
In `ProgramManager::removeImages`, we cleanup our KernelName2KernelID
mapping, before using that exact mapping retrieve KernelIDs in order to
clean up our KernelIDs2BinImage mapping. This PR cleans up
`m_KernelID2BinImage` mapping before cleaning up
`m_KernelName2KernelIDs` maping.

This is inteded to fix a hit raised by Coverity.
This patch fixes up and enable memory pools for the HIP adapter, it is
based on oneapi-src/unified-runtime#1689 and on
the CUDA adapter implementation.

The initial patch had segmentation faults in the CI that we couldn't
reproduce locally. That happened as well in this patch and I couldn't
reproduce the segfaults locally either.

However I noticed that it failed in `urUSMHostAlloc`, and that entry
point was different from the CUDA adapter version, in that the HIP
adapter was using a "helper" function. It turns out that the helper
function was using a device pool instead of a host pool to do the
allocation, which seemed obviously wrong. Replacing the helper by
similar code used in the CUDA adapter fixes the crash in the CI.
…l#17850)

This ensures that functions have the right linkage.

Several functions are marked as used to prevent them from being removed
as dead code before the work item loop pass and
`PrepareSYCLNativeCPUPass` run.
@Bensuo Bensuo force-pushed the ben/graph_async_alloc_impl branch from 47c4a78 to 8b8f12e Compare April 14, 2025 13:08
`urDeviceRetain` and `urDeviceRelease` are no-ops, so this patch removes
all uses.

Uses of it in the context will be removed in
intel#17571
@Bensuo Bensuo force-pushed the ben/graph_async_alloc_impl branch from 8b8f12e to 0d47482 Compare April 14, 2025 13:23
Pool tracking is now always enabled for UMF and cannot be disabled, so
we can remove these ifdefs.
if (auto Graph = h.getCommandGraph(); Graph) {
// Check if the pointer to be freed has an associated allocation node, and
// error if not
if (!Graph->getMemPool().hasAllocation(ptr)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a section to the design doc on the async malloc/free impl. Doesn't need to be a lot, but could note things like free being a no-op as we don't do memory reallocation yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some wording for these now, thanks!


// The number of live executable graphs that have been created from this
// modifiable graph
size_t MExecGraphCount = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be atomic?

Also, It's also interesting that the spec for finalize error says to throw if "the graph has previously been finalized."
not just to throw if you finalize while there is already an alive executable graph.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, It's also interesting that the spec for finalize error says to throw if "the graph has previously been finalized." not just to throw if you finalize while there is already an alive executable graph.

Ah good catch, that was some old wording, I've updated the spec now to reflect the correct behaviour as shown here.

should this be atomic?

Hmmm, maybe it should be yeah. I was thinking the locks we have would be enough to avoid issues but actually because the decrement of this count is called from a constructor we can't use a lock. I think making it atomic will make it thread safe but with the usual caveats that ordering is not guaranteed. AKA if a user is destroying an executable graph and simultaneously trying to finalize a new one from different threads we can't guarantee a correct ordering in that case.

npmiller and others added 5 commits April 14, 2025 15:04
This patch should essentially be NFC. The way it's currently setup is
still very hacky and could use further refactoring, but this patch just
isolates the hack a little bit and makes it more obvious what's going
on, as well as removing some code duplication.
… add respective warning (intel#17947)

Currently some of our docs suggest using `get(properties_tag) {...}`
member in kernel functors to specify kernel properties while some
suggest using `get(properties_tag) const {...}`. But actually
`get(properties_tag) {...}` might not work under certain conditions, so
we should update the incorrect docs that they suggest readers to use
`get(properties_tag) const {...}` rather than `get(properties_tag)
{...}`.

Also as @Pennycook suggested, added a diagnostic warning to indicate the
user when:
-they have declared a `get(properties_tag)` member in the kernel functor
AND
-they declared `get(properties_tag){...}` rather than
`get(properties_tag) const {...}`

Updated the tests containing the non-const version of
`get(properties_tag)` as well.

---------

Signed-off-by: Hu, Peisen <[email protected]>
The workflow was renamed.
The other change related to increasing the sizes is due to the fact that
stride has to be > 64 bytes in order for 2d block read/write is
correctly generated.
We need to set `cuda-path`/`hip-path` on Windows as the autodetection
isn't implemented upstream, it seems only implemented for Linux.

Also fix a double assign and a missed update in `E2EExpr`

This was tested with some other pending changes
[here](https://github.com/intel/llvm/actions/runs/14389689524/job/40353640119?pr=14114)
and
[here](https://github.com/intel/llvm/actions/runs/14389689487/job/40353627543?pr=14114).

Signed-off-by: Sarnie, Nick <[email protected]>
@Bensuo Bensuo force-pushed the ben/graph_async_alloc_impl branch 2 times, most recently from f22eba9 to 39caa11 Compare April 14, 2025 16:55
hvdijk and others added 11 commits April 14, 2025 18:01
)

Several functions were being defined both in libclc and in libdevice,
but had different definitions between the two. This commit changes it so
that they are only defined once, in libclc, with the definition that we
had in libdevice that we were relying on taking precedence.
- Fix barrier wait cleanup of events given an out event such that an
additional refcnt release is done and with multi command lists and in
order, the active barrier is properly cleared before reassignment.

Signed-off-by: Neil R. Spruit <[email protected]>
cleanup UMF build flags: UMF_ENABLE_POOL_TRACKING and
UMF_BUILD_LIBUMF_POOL_DISJOINT have been removed from the latest UMF.
The features that were previously enabled by these flags are now always
enabled.
I fixed the LIT infra so this works now. AMDGPU one is coming later.

Signed-off-by: Sarnie, Nick <[email protected]>
…k dashboard via sycl-docs.yml (intel#17522)

This PR:
- Introduces new CI to replace existing nightly benchmarking workflow
- Adds a preliminary implementation of a comparison script alongside the
existing benchmarking scripts: This comparison script is used by the CI
to decide whether or not a benchmark result is over a (predefined)
tolerance above historical median of benchmark results
  
This is not a permanent solution: I am looking into exponential moving
averages in the future
- Adds code in sycl-docs.yml for bringing up new benchmark results
dashboard on https://intel.github.io

---------

Co-authored-by: Piotr Balcer <[email protected]>
Co-authored-by: Łukasz Stolarczuk <[email protected]>
On the application side, kernel names can be retrieved as a const char*
from the integration header or built-ins. On the library side, they are
retrieved from the offload entries. With the recent introduction of the
__sycl_unregister_lib implementation, there shouldn't be any need to
store copies of those strings anymore.
…est (intel#17691)

Replace the following e2e tests with unit test:
```
    ProgramManager/multi_device_bundle/device_libs_and_caching.cpp
    ProgramManager/multi_device_bundle/build_twice.cpp
    KernelAndProgram/persistent-cache-multi-device.cpp
```
    
The reason is that the checks performed in those e2e tests (number of UR
calls, etc.) can be done more effificently using unit tests, such checks
in e2e tests are expensive. Another reason is that e2e tests use
NEO-specific `CreateMultipleRootDevices=[num]` environment variable
which is not guaranteed to work, in unit tests we can emulate as many
devices as we want.
Logically, it affects subsequent `%{run}` lines so must be done in
run-mode, not in build-mode.
Neil suggested we do this, and the prerelease versions probably won't
have binaries anyway so trying to use the artifact will fail.

Formatter made me fix some unrelated stuff too.

---------

Signed-off-by: Sarnie, Nick <[email protected]>
* Move copyright notices out of file `__doc__` strings and into comments
* Move function `__doc__` strings into functions
@Bensuo Bensuo force-pushed the ben/graph_async_alloc_impl branch from fe51208 to f93ce75 Compare April 15, 2025 16:09
After intel#18015 `wrap_kernel` is powerful
enough to substitute it with just a minor tweak.
@Bensuo Bensuo force-pushed the ben/graph_async_alloc_impl branch from f93ce75 to 5638e1b Compare April 15, 2025 16:34
jbrodman and others added 13 commits April 15, 2025 18:36
Remove the legacy implementation of emulation for OoO queues. This code
predated the PI and UR layers in DPC++ because fpga OpenCL did not
support OoO queues. Today, this type of thing should be done by the
backend compiler or UR layer.

---------

Signed-off-by: James Brodman <[email protected]>
Level Zero had a new full-quality release and it has binaries so we can
use it now.

Also, we've had this issue forever:
```
The following packages have unmet dependencies:
 intel-igc-cm : Depends: intel-igc-core (>= 1.0.9995) but it is not installabl
```

The problem was the prerelease version of CM has bad dependencies, use
the latest released version which works fine, and we don't really use CM
anyway.

Formatter made me fix some unrelated areas again.

---------

Signed-off-by: Sarnie, Nick <[email protected]>
`std::ignore` seems to have been adopted as some sort of ersatz `(void)`
cast or `[[maybe_unused]]` annotation on unused parameters. Since the
following are true:
1. The behavior of `std::ignore` outside of `std::tie` is not formally
specified;
  2. C++ already has builtin ways to express ignorance of a value;
3. Compiling `std::tie` is much more expensive than using the builtin
alternatives [1];
  4. It's confusing to read;
5. It requires an extra header inclusion: I decided to simply omit the
argument name in all cases where this is possible.

This is a mostly completely mechanical transformation, using the
following script:
```
find unified-runtime -name "*.[ch]pp" \
  | xargs sed -i '/^\s*std::ignore\s*=\s*[a-zA-Z_][a-zA-Z0-9_]*;/d' \
  &&  git ls-files -m | parallel clang-tidy \
    '--checks="-*,misc-unused-parameters"' --fix --fix-errors -p build
```
Followed by manually introducting `[[maybe_unused]]` in `NDEBUG` blocks
in a few places.

[1]: Both clang and gcc with libstdc++ and libc++ show similar slowdowns
compiling the std::ignore vs void.
I didn't try with MSVC but I see no reason it wouldn't be similar
```
$ ipython3
In [1]: def ignore(n):
   ...:     ignores = 'std::ignore = x;\n' * n
   ...:     return f'''#include <tuple>\nint foo(int x) {{ {ignores} \n return x;}}'''
   ...:

In [2]: def void(n):
   ...:     voids = '(void)x;\n' * n
   ...:     return f'''int foo(int x) {{ {voids} \n return x;\n}}'''
   ...:

In [3]: with open("ignore.cpp", "w") as f:
   ...:     f.write(ignore(100000))
   ...:

In [4]: with open("voids.cpp", "w") as f:
   ...:     f.write(void(100000))
   ...:
$ time c++ -S voids.cpp
c++ -S voids.cpp  0.18s user 0.03s system 99% cpu 0.211 total
$ time c++ -S ignore.cpp
c++ -S ignore.cpp  4.50s user 0.34s system 99% cpu 4.836 total
```
I manually tested it and it finally works this time.

Signed-off-by: Sarnie, Nick <[email protected]>
handler's secondary queue is not used for any meaningful work.
…ion (intel#18023)

As agreement in intel#16729 , In general we
discourage the use of the extension strings as it is a remnant of the
OpenCL days. Instead we want to use urDeviceGetInfo when possible. This
is a follow-up pr to replace checking OpenCL extension string with
urGetDeviceInfo for native bfloat16 conversion extension.

---------

Signed-off-by: jinge90 <[email protected]>
Co-authored-by: Steffen Larsen <[email protected]>
…ed diagnostics if kernel violates restrictions (intel#17954)

This PR adds test for free function kernels extension that checks
emitted diagnostics if kernel violates restrictions specified in spec
for declaration of free function kernel.

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_free_function_kernels.asciidoc#defining-a-free-function-kernel

Test introduced based on test plan:
intel#17886
…tel#18026)

* Disable inserting unreachable instruction since we have another
  mechanism to exit current kernel execution.
* Cleanup clean shadow after it's allocated.
* Remove redzone maximum limitation.
Change intel#17095 introduced a failure in the logger_validation test due to
an incorrect merging of the default and "current" log level. This
change reverts to the old behavior where an invalid `output` value
causes
the logger to use the default `level`.
…ntel#18049)

The latest asan libdevice source triggers a compiler warning:
llvm/libdevice/sanitizer/asan_rtl.cpp:342:3: warning: label at end of
compound statement is a C++23 extension [-Wc++23-extensions]
This trivial pr removes it.

Signed-off-by: jinge90 <[email protected]>
…18029)

Linear IDs (the `0` in `opencl:0`) did not function as intended when
there was multiple platforms; each platform started counting at zero.
This patch changes it so that they start counting at an appropriate
number that should match `urinfo`.

`urinfo` itself has also been updated to not display linear IDs when
the device list is being filtered, as they will not be correct. This
matches the behavior of sycl-ls and is different to the
`--no-linear-ids` which changes the format of the output.
@Bensuo Bensuo force-pushed the ben/graph_async_alloc_impl branch from 5638e1b to 306b5a2 Compare April 16, 2025 10:57
KornevNikita and others added 11 commits April 16, 2025 16:21
And also disable aws-stop in sycl-rel nightly if aws-start was skipped.
… message (intel#17990)

Based on the user's feedback in intel#17957
and intel#17914

---------

Co-authored-by: Steffen Larsen <[email protected]>
Follow up of intel#17967
Secondary submission queue is optional in SYCL Spec and it has been
decided to remove it.
…ntel#18044)

None of the E2E tests fail in "preview" mode so the functionality hasn't
even been tested for awhile. Separate the code into a separate legacy
helper to avoid touching in future refactorings for the up-to-date code.
…#18065)

No value in thin wrappers requiring extra instantiation for every
kernel.
Fix cts segment fault on accessing shared usm
Also set default name to nameless globals since sanitizer cannot handle
nameless globals.
…tel#17929)

Removes the remaining fusion implementation and refactors the `sycl-jit`
library to better capture the current use-cases: SYCL runtime
compilation (RTC), and specialization constant materialization (SCM).

High-level notes:
- Split the entrypoints for RTC, SCM and option handling, formerly in
`KernelFusion[.h|.cpp]`, into separate files
- A significant amount of infrastructure became unused after removing
the fusion implementation. The remaining facilities are:
  - CMake setup to link LLVM libraries and produce a pass plugin library
- Translation from LLVM module to SPIRV and PTX/AMDGCN (loading and
linking multiple SPIRV or LLVM modules was unused and removed)
- `JITContext` class owning "binaries" (could be SPIRV blob or
PTX/AMDGCN text). At the interface, we use `JITBinaryInfo` (fka
`SYCLKernelBinaryInfo`), which is a non-owning descriptor type
  - Option handling
- Dropped "fusion" terminology, and mostly dropped "kernel" terminology,
because RTC and SCM deal with device images (runtime-side) respectively
LLVM modules (jit-side). One exception is the translation helper, which
optionally takes a kernel name, which is in turned used by the
PTX/AMDGCN translation to obtain target attributes from the
corresponding kernel function.
- The `sycl-jit-common` library (in `sycl-jit/common` directory) was
removed because it became header-only and the headers were no longer
used by the pass library (so could be moved to
`sycl-jit/jit-compiler/include`)

---------

Signed-off-by: Julian Oppermann <[email protected]>
Co-authored-by: Steffen Larsen <[email protected]>
…8055)

This fixes debug builds of UR with CUDA enabled; in that configuration
the `test-adapter-cuda` emits a call to `umf::umf2urResult` which
requires an implementaiton of `umf::getProviderNativeError`.
…efore calling func (intel#17681)

Its PR migrated from
oneapi-src/unified-runtime#2746
It looks like a common typo when writing functions.

Fixed functions:

- urEnqueueMemBufferWriteRect()
- urEnqueueMemBufferReadRect()

Co-authored-by: Nicolas Miller <[email protected]>
- Allocations managed via new graph_mem_pool
- Device allocations use virtual memory
- Intercept async_alloc calls when adding nodes to graph
- New tests for functionality
- Design doc information about implementation
@Bensuo Bensuo force-pushed the ben/graph_async_alloc_impl branch from 306b5a2 to 3d6204f Compare April 17, 2025 11:13
@Bensuo Bensuo closed this Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.