-
Notifications
You must be signed in to change notification settings - Fork 1.1k
common: verbose: asynchronous verbose mode for execution time tracking #3055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
25b0638
to
bf1e8d1
Compare
bf1e8d1
to
625eec4
Compare
625eec4
to
e69f76d
Compare
c834a20
to
dc4f76d
Compare
src/gpu/intel/ocl/stream.cpp
Outdated
return status::success; | ||
|
||
} else { | ||
cl_int err = clWaitForEvents(1, &async_tracked_event_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this call synchronous? We enqueue a kernel, record an event and execution blocks here, until the kernel finishes. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This was a fallback for failure cases where the verbose info is then printed with the default stream.wait()
calls. The implementation has been updated to avoid repetition.
dc4f76d
to
8051a73
Compare
8051a73
to
ad7be6d
Compare
ad7be6d
to
c9a7d45
Compare
c9a7d45
to
4513647
Compare
Description
This PR proposes a PoC for introducing an asynchronous verbose mode to accurately track kernel execution times in a non-blocking manner with minimal synchronization latencies. For the verbose mode, retrieving the kernel timing causes significant overhead as it requires the GPU kernel execution to be synchronized and also because it is tracked on the host.
The asynchronous mode removes the synchronization overhead by using event callbacks to query execution timings.
The prototype is created for a OpenCL GPU API that provides the kernel execution statistics for profiling.
The implementation enabled at run-time with
DNNL_ASYNC_VERBOSE=1
:Related RFC: [link]
Addresses MFDNN-13603.
Checklist