-
Notifications
You must be signed in to change notification settings - Fork 62
Add sychronization calls after hipMemcpyDeviceToDevice #1850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ec70275 to
ced1c5d
Compare
|
Can you check if CUDA needs the same? |
|
I looked and honestly I have no idea. I can check more next week. |
|
This thread has some insight, the answer is we probably don't need to sync, but it isn't host synchronous. I'm not sure if hip is the same way, i.e. if it's okay that it's not synchronous on host. https://forums.developer.nvidia.com/t/are-cudamemcpy-and-cudamalloc-blocking-synchronous/308368/2 |
|
Did you have a use case for HIP where you saw a problem from this? I believe we only ever use one queue for CUDA/HIP, right? Then I don't think it should matter if it's asynchronous with respect to host -- the host can move on and do other commands without waiting on the call to finish, but anything else using that memory on the device will be submitted to the same queue and have to "wait its turn" before starting. Same for the beginning of the copy, it will wait for previous kernels to finish because they're all using the same queue, right? |
|
We run the suboperators of a composite operator on separate streams. What would happen if we
|
|
I don't entirely understand null stream semantics, but I think that null stream operations block all other streams? |
Wait, how does that work? I thought only the SYCL backends implemented |
|
Ah, I see. So in general, a Looking at the code in 6eee1ff, it looks like we're already syncing each stream before moving on, so I don't see a problem. I assume there's no "crossing the streams" between the suboperators which is why we can put them in separate streams to start? Even if there were dependencies, it'd be better to have a way to handle that other than putting a whole device sync in |
|
Oh wait, you mean on the "other side", i.e., before we even start the suboperators in their streams? |
|
Right, we could theoretically copy data from a user's device pointer to our own before executing a suboperator with that memory |
|
From the user's perspective, the |
|
We're the ones doing the operating |
|
By "we", do you mean libCEED itself -- but as called by some other code/application (e.g., could be previous calls to libCEED that haven't finished yet)? Right now, the operator apply with suboperators is the outlier that behaves differently than the other backend functions for CUDA/HIP, so I think ideally, it should be that function's "responsibility" to handle it, rather than forcing a device sync on every devicetodevice copy call (which might mean we end up waiting on another queue that has no impact on us). But to make sure that previous libCEED calls are finished before we split into the other streams for the suboperators, we'd have to add a sync on the NULL stream (which all the other libCEED calls use)...which has the same effect, since it forces any other streams to wait, too. And that would probably be worse in terms of performance since I don't know how often the devicetodevice copy is actually getting called in a typical workflow... Unless you want to implement |
|
This change is supposed to guard the situation where
|
|
What about the same scenario, but where we replace steps 1 and 2 with "user launches another operation on the NULL stream that affects the input data of the operator"? We'd have the same problem of not waiting for it to be done before starting to apply the suboperators in their own streams, but the sync in devicetodevice copy wouldn't help us (because we're not calling that). I guess we could've always had this problem if the user was doing something else in a non-NULL stream to input data for a libCEED function. (But then I would also say it's the user's responsibility to add the sync, because they're the ones using other streams, when libCEED was consistent about always using NULL.) Now, by using other streams for the suboperators, have we added the possibility that even if they're using the NULL stream only, there could be ordering problems when applying a libCEED Operator? I guess the |
|
If a user is bypassing our access model, there's nothing we can do about that. It will always be a bug.
No, we synchronize before returning |
I know that, the question was just a rephrasing of my point about things not finishing before the Operator applies (i.e., that's where we introduce the ordering problem). In fact, I take back what I said about the Vector state monitoring, because |
|
I'm going to say that this is probably fine; if we later observe any strange behavior, we can reopen. |
Turns out that
hipMemcpywithhipMemcpyDeviceToDeviceis asynchronous on the host. This MR adds synchronization calls to prevent any potential race conditions.