-
Notifications
You must be signed in to change notification settings - Fork 3
[SYCL][Graph] Initial implementation of graph-owned device allocations #385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Bensuo
commented
Apr 10, 2025
- Allocations managed via new graph_mem_pool
- Device allocations use virtual memory
- Intercept async_alloc calls when adding nodes to graph
- New tests for functionality
In `ProgramManager::removeImages`, we cleanup our KernelName2KernelID mapping, before using that exact mapping retrieve KernelIDs in order to clean up our KernelIDs2BinImage mapping. This PR cleans up `m_KernelID2BinImage` mapping before cleaning up `m_KernelName2KernelIDs` maping. This is inteded to fix a hit raised by Coverity.
…-g is passed. (intel#17987) Reverts intel#16408
This fixes a regression after intel#17876
This patch fixes up and enable memory pools for the HIP adapter, it is based on oneapi-src/unified-runtime#1689 and on the CUDA adapter implementation. The initial patch had segmentation faults in the CI that we couldn't reproduce locally. That happened as well in this patch and I couldn't reproduce the segfaults locally either. However I noticed that it failed in `urUSMHostAlloc`, and that entry point was different from the CUDA adapter version, in that the HIP adapter was using a "helper" function. It turns out that the helper function was using a device pool instead of a host pool to do the allocation, which seemed obviously wrong. Replacing the helper by similar code used in the CUDA adapter fixes the crash in the CI.
Follow up from intel#17931, additional fix for URT-903.
…l#17850) This ensures that functions have the right linkage. Several functions are marked as used to prevent them from being removed as dead code before the work item loop pass and `PrepareSYCLNativeCPUPass` run.
47c4a78
to
8b8f12e
Compare
`urDeviceRetain` and `urDeviceRelease` are no-ops, so this patch removes all uses. Uses of it in the context will be removed in intel#17571
8b8f12e
to
0d47482
Compare
Pool tracking is now always enabled for UMF and cannot be disabled, so we can remove these ifdefs.
if (auto Graph = h.getCommandGraph(); Graph) { | ||
// Check if the pointer to be freed has an associated allocation node, and | ||
// error if not | ||
if (!Graph->getMemPool().hasAllocation(ptr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a section to the design doc on the async malloc/free impl. Doesn't need to be a lot, but could note things like free being a no-op as we don't do memory reallocation yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some wording for these now, thanks!
sycl/source/detail/graph_impl.hpp
Outdated
|
||
// The number of live executable graphs that have been created from this | ||
// modifiable graph | ||
size_t MExecGraphCount = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be atomic?
Also, It's also interesting that the spec for finalize error says to throw if "the graph has previously been finalized."
not just to throw if you finalize while there is already an alive executable graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, It's also interesting that the spec for finalize error says to throw if "the graph has previously been finalized." not just to throw if you finalize while there is already an alive executable graph.
Ah good catch, that was some old wording, I've updated the spec now to reflect the correct behaviour as shown here.
should this be atomic?
Hmmm, maybe it should be yeah. I was thinking the locks we have would be enough to avoid issues but actually because the decrement of this count is called from a constructor we can't use a lock. I think making it atomic will make it thread safe but with the usual caveats that ordering is not guaranteed. AKA if a user is destroying an executable graph and simultaneously trying to finalize a new one from different threads we can't guarantee a correct ordering in that case.
This patch should essentially be NFC. The way it's currently setup is still very hacky and could use further refactoring, but this patch just isolates the hack a little bit and makes it more obvious what's going on, as well as removing some code duplication.
… add respective warning (intel#17947) Currently some of our docs suggest using `get(properties_tag) {...}` member in kernel functors to specify kernel properties while some suggest using `get(properties_tag) const {...}`. But actually `get(properties_tag) {...}` might not work under certain conditions, so we should update the incorrect docs that they suggest readers to use `get(properties_tag) const {...}` rather than `get(properties_tag) {...}`. Also as @Pennycook suggested, added a diagnostic warning to indicate the user when: -they have declared a `get(properties_tag)` member in the kernel functor AND -they declared `get(properties_tag){...}` rather than `get(properties_tag) const {...}` Updated the tests containing the non-const version of `get(properties_tag)` as well. --------- Signed-off-by: Hu, Peisen <[email protected]>
The workflow was renamed.
The other change related to increasing the sizes is due to the fact that stride has to be > 64 bytes in order for 2d block read/write is correctly generated.
We need to set `cuda-path`/`hip-path` on Windows as the autodetection isn't implemented upstream, it seems only implemented for Linux. Also fix a double assign and a missed update in `E2EExpr` This was tested with some other pending changes [here](https://github.com/intel/llvm/actions/runs/14389689524/job/40353640119?pr=14114) and [here](https://github.com/intel/llvm/actions/runs/14389689487/job/40353627543?pr=14114). Signed-off-by: Sarnie, Nick <[email protected]>
f22eba9
to
39caa11
Compare
- Fix barrier wait cleanup of events given an out event such that an additional refcnt release is done and with multi command lists and in order, the active barrier is properly cleared before reassignment. Signed-off-by: Neil R. Spruit <[email protected]>
cleanup UMF build flags: UMF_ENABLE_POOL_TRACKING and UMF_BUILD_LIBUMF_POOL_DISJOINT have been removed from the latest UMF. The features that were previously enabled by these flags are now always enabled.
I fixed the LIT infra so this works now. AMDGPU one is coming later. Signed-off-by: Sarnie, Nick <[email protected]>
…k dashboard via sycl-docs.yml (intel#17522) This PR: - Introduces new CI to replace existing nightly benchmarking workflow - Adds a preliminary implementation of a comparison script alongside the existing benchmarking scripts: This comparison script is used by the CI to decide whether or not a benchmark result is over a (predefined) tolerance above historical median of benchmark results This is not a permanent solution: I am looking into exponential moving averages in the future - Adds code in sycl-docs.yml for bringing up new benchmark results dashboard on https://intel.github.io --------- Co-authored-by: Piotr Balcer <[email protected]> Co-authored-by: Łukasz Stolarczuk <[email protected]>
On the application side, kernel names can be retrieved as a const char* from the integration header or built-ins. On the library side, they are retrieved from the offload entries. With the recent introduction of the __sycl_unregister_lib implementation, there shouldn't be any need to store copies of those strings anymore.
…est (intel#17691) Replace the following e2e tests with unit test: ``` ProgramManager/multi_device_bundle/device_libs_and_caching.cpp ProgramManager/multi_device_bundle/build_twice.cpp KernelAndProgram/persistent-cache-multi-device.cpp ``` The reason is that the checks performed in those e2e tests (number of UR calls, etc.) can be done more effificently using unit tests, such checks in e2e tests are expensive. Another reason is that e2e tests use NEO-specific `CreateMultipleRootDevices=[num]` environment variable which is not guaranteed to work, in unit tests we can emulate as many devices as we want.
Follows what I started in intel#17838.
Logically, it affects subsequent `%{run}` lines so must be done in run-mode, not in build-mode.
Neil suggested we do this, and the prerelease versions probably won't have binaries anyway so trying to use the artifact will fail. Formatter made me fix some unrelated stuff too. --------- Signed-off-by: Sarnie, Nick <[email protected]>
* Move copyright notices out of file `__doc__` strings and into comments * Move function `__doc__` strings into functions
fe51208
to
f93ce75
Compare
After intel#18015 `wrap_kernel` is powerful enough to substitute it with just a minor tweak.
f93ce75
to
5638e1b
Compare
Remove the legacy implementation of emulation for OoO queues. This code predated the PI and UR layers in DPC++ because fpga OpenCL did not support OoO queues. Today, this type of thing should be done by the backend compiler or UR layer. --------- Signed-off-by: James Brodman <[email protected]>
Level Zero had a new full-quality release and it has binaries so we can use it now. Also, we've had this issue forever: ``` The following packages have unmet dependencies: intel-igc-cm : Depends: intel-igc-core (>= 1.0.9995) but it is not installabl ``` The problem was the prerelease version of CM has bad dependencies, use the latest released version which works fine, and we don't really use CM anyway. Formatter made me fix some unrelated areas again. --------- Signed-off-by: Sarnie, Nick <[email protected]>
Sigh Signed-off-by: Sarnie, Nick <[email protected]>
`std::ignore` seems to have been adopted as some sort of ersatz `(void)` cast or `[[maybe_unused]]` annotation on unused parameters. Since the following are true: 1. The behavior of `std::ignore` outside of `std::tie` is not formally specified; 2. C++ already has builtin ways to express ignorance of a value; 3. Compiling `std::tie` is much more expensive than using the builtin alternatives [1]; 4. It's confusing to read; 5. It requires an extra header inclusion: I decided to simply omit the argument name in all cases where this is possible. This is a mostly completely mechanical transformation, using the following script: ``` find unified-runtime -name "*.[ch]pp" \ | xargs sed -i '/^\s*std::ignore\s*=\s*[a-zA-Z_][a-zA-Z0-9_]*;/d' \ && git ls-files -m | parallel clang-tidy \ '--checks="-*,misc-unused-parameters"' --fix --fix-errors -p build ``` Followed by manually introducting `[[maybe_unused]]` in `NDEBUG` blocks in a few places. [1]: Both clang and gcc with libstdc++ and libc++ show similar slowdowns compiling the std::ignore vs void. I didn't try with MSVC but I see no reason it wouldn't be similar ``` $ ipython3 In [1]: def ignore(n): ...: ignores = 'std::ignore = x;\n' * n ...: return f'''#include <tuple>\nint foo(int x) {{ {ignores} \n return x;}}''' ...: In [2]: def void(n): ...: voids = '(void)x;\n' * n ...: return f'''int foo(int x) {{ {voids} \n return x;\n}}''' ...: In [3]: with open("ignore.cpp", "w") as f: ...: f.write(ignore(100000)) ...: In [4]: with open("voids.cpp", "w") as f: ...: f.write(void(100000)) ...: $ time c++ -S voids.cpp c++ -S voids.cpp 0.18s user 0.03s system 99% cpu 0.211 total $ time c++ -S ignore.cpp c++ -S ignore.cpp 4.50s user 0.34s system 99% cpu 4.836 total ```
I manually tested it and it finally works this time. Signed-off-by: Sarnie, Nick <[email protected]>
handler's secondary queue is not used for any meaningful work.
…ion (intel#18023) As agreement in intel#16729 , In general we discourage the use of the extension strings as it is a remnant of the OpenCL days. Instead we want to use urDeviceGetInfo when possible. This is a follow-up pr to replace checking OpenCL extension string with urGetDeviceInfo for native bfloat16 conversion extension. --------- Signed-off-by: jinge90 <[email protected]> Co-authored-by: Steffen Larsen <[email protected]>
…ed diagnostics if kernel violates restrictions (intel#17954) This PR adds test for free function kernels extension that checks emitted diagnostics if kernel violates restrictions specified in spec for declaration of free function kernel. https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_free_function_kernels.asciidoc#defining-a-free-function-kernel Test introduced based on test plan: intel#17886
…tel#18026) * Disable inserting unreachable instruction since we have another mechanism to exit current kernel execution. * Cleanup clean shadow after it's allocated. * Remove redzone maximum limitation.
Change intel#17095 introduced a failure in the logger_validation test due to an incorrect merging of the default and "current" log level. This change reverts to the old behavior where an invalid `output` value causes the logger to use the default `level`.
…ntel#18049) The latest asan libdevice source triggers a compiler warning: llvm/libdevice/sanitizer/asan_rtl.cpp:342:3: warning: label at end of compound statement is a C++23 extension [-Wc++23-extensions] This trivial pr removes it. Signed-off-by: jinge90 <[email protected]>
…18029) Linear IDs (the `0` in `opencl:0`) did not function as intended when there was multiple platforms; each platform started counting at zero. This patch changes it so that they start counting at an appropriate number that should match `urinfo`. `urinfo` itself has also been updated to not display linear IDs when the device list is being filtered, as they will not be correct. This matches the behavior of sycl-ls and is different to the `--no-linear-ids` which changes the format of the output.
5638e1b
to
306b5a2
Compare
And also disable aws-stop in sycl-rel nightly if aws-start was skipped.
… message (intel#17990) Based on the user's feedback in intel#17957 and intel#17914 --------- Co-authored-by: Steffen Larsen <[email protected]>
Follow up of intel#17967 Secondary submission queue is optional in SYCL Spec and it has been decided to remove it.
…ntel#18044) None of the E2E tests fail in "preview" mode so the functionality hasn't even been tested for awhile. Separate the code into a separate legacy helper to avoid touching in future refactorings for the up-to-date code.
…#18065) No value in thin wrappers requiring extra instantiation for every kernel.
Fix cts segment fault on accessing shared usm
Also set default name to nameless globals since sanitizer cannot handle nameless globals.
…tel#17929) Removes the remaining fusion implementation and refactors the `sycl-jit` library to better capture the current use-cases: SYCL runtime compilation (RTC), and specialization constant materialization (SCM). High-level notes: - Split the entrypoints for RTC, SCM and option handling, formerly in `KernelFusion[.h|.cpp]`, into separate files - A significant amount of infrastructure became unused after removing the fusion implementation. The remaining facilities are: - CMake setup to link LLVM libraries and produce a pass plugin library - Translation from LLVM module to SPIRV and PTX/AMDGCN (loading and linking multiple SPIRV or LLVM modules was unused and removed) - `JITContext` class owning "binaries" (could be SPIRV blob or PTX/AMDGCN text). At the interface, we use `JITBinaryInfo` (fka `SYCLKernelBinaryInfo`), which is a non-owning descriptor type - Option handling - Dropped "fusion" terminology, and mostly dropped "kernel" terminology, because RTC and SCM deal with device images (runtime-side) respectively LLVM modules (jit-side). One exception is the translation helper, which optionally takes a kernel name, which is in turned used by the PTX/AMDGCN translation to obtain target attributes from the corresponding kernel function. - The `sycl-jit-common` library (in `sycl-jit/common` directory) was removed because it became header-only and the headers were no longer used by the pass library (so could be moved to `sycl-jit/jit-compiler/include`) --------- Signed-off-by: Julian Oppermann <[email protected]> Co-authored-by: Steffen Larsen <[email protected]>
…8055) This fixes debug builds of UR with CUDA enabled; in that configuration the `test-adapter-cuda` emits a call to `umf::umf2urResult` which requires an implementaiton of `umf::getProviderNativeError`.
…efore calling func (intel#17681) Its PR migrated from oneapi-src/unified-runtime#2746 It looks like a common typo when writing functions. Fixed functions: - urEnqueueMemBufferWriteRect() - urEnqueueMemBufferReadRect() Co-authored-by: Nicolas Miller <[email protected]>
- Allocations managed via new graph_mem_pool - Device allocations use virtual memory - Intercept async_alloc calls when adding nodes to graph - New tests for functionality - Design doc information about implementation
306b5a2
to
3d6204f
Compare