Add sychronization calls after hipMemcpyDeviceToDevice #1850

zatkins-dev · 2025-07-02T16:44:55Z

Turns out that hipMemcpy with hipMemcpyDeviceToDevice is asynchronous on the host. This MR adds synchronization calls to prevent any potential race conditions.

jeremylt · 2025-07-02T16:57:34Z

Can you check if CUDA needs the same?

zatkins-dev · 2025-07-03T17:03:12Z

I looked and honestly I have no idea. I can check more next week.

zatkins-dev · 2025-07-03T17:07:44Z

This thread has some insight, the answer is we probably don't need to sync, but it isn't host synchronous. I'm not sure if hip is the same way, i.e. if it's okay that it's not synchronous on host.

https://forums.developer.nvidia.com/t/are-cudamemcpy-and-cudamalloc-blocking-synchronous/308368/2

nbeams · 2025-07-03T17:16:15Z

Did you have a use case for HIP where you saw a problem from this?

I believe we only ever use one queue for CUDA/HIP, right? Then I don't think it should matter if it's asynchronous with respect to host -- the host can move on and do other commands without waiting on the call to finish, but anything else using that memory on the device will be submitted to the same queue and have to "wait its turn" before starting. Same for the beginning of the copy, it will wait for previous kernels to finish because they're all using the same queue, right?

jeremylt · 2025-07-03T17:34:35Z

We run the suboperators of a composite operator on separate streams.

What would happen if we

do a device to device transfer that doesn't immediately resolve
get the address of the (stale) memory, believing its up to date
launch an operator kernel with this stale memory

zatkins-dev · 2025-07-03T17:38:02Z

I don't entirely understand null stream semantics, but I think that null stream operations block all other streams?

nbeams · 2025-07-03T17:54:56Z

We run the suboperators of a composite operator on separate streams.

Wait, how does that work? I thought only the SYCL backends implemented CeedSetStream (since there's no such thing as a NULL stream for SYCL)?

jeremylt · 2025-07-03T17:58:09Z

6eee1ff

nbeams · 2025-07-03T18:15:22Z

Ah, I see. So in general, a Ceed object is associated with one stream only, with everything submitted there, with the suboperators being an exception? But then for each suboperator, everything happens in its own stream.

Looking at the code in 6eee1ff, it looks like we're already syncing each stream before moving on, so I don't see a problem. I assume there's no "crossing the streams" between the suboperators which is why we can put them in separate streams to start? Even if there were dependencies, it'd be better to have a way to handle that other than putting a whole device sync in CeedSetDeviceGenericArray, which would be more heavy-handed than we really need.

nbeams · 2025-07-03T18:24:16Z

Oh wait, you mean on the "other side", i.e., before we even start the suboperators in their streams?

jeremylt · 2025-07-03T18:25:42Z

Right, we could theoretically copy data from a user's device pointer to our own before executing a suboperator with that memory

nbeams · 2025-07-03T18:30:04Z

From the user's perspective, the Ceed object is always/only using the NULL stream (for CUDA/HIP). If they are trying to give us input memory that's being operated on by another stream, I'd say it's their responsibility to put the sync command before calling a libCEED operation on it, no?

jeremylt · 2025-07-03T20:27:49Z

We're the ones doing the operating

nbeams · 2025-07-03T21:12:10Z

By "we", do you mean libCEED itself -- but as called by some other code/application (e.g., could be previous calls to libCEED that haven't finished yet)?

Right now, the operator apply with suboperators is the outlier that behaves differently than the other backend functions for CUDA/HIP, so I think ideally, it should be that function's "responsibility" to handle it, rather than forcing a device sync on every devicetodevice copy call (which might mean we end up waiting on another queue that has no impact on us). But to make sure that previous libCEED calls are finished before we split into the other streams for the suboperators, we'd have to add a sync on the NULL stream (which all the other libCEED calls use)...which has the same effect, since it forces any other streams to wait, too. And that would probably be worse in terms of performance since I don't know how often the devicetodevice copy is actually getting called in a typical workflow...

Unless you want to implement CeedSetStream for CUDA/HIP, and say that if you want to use composite operators, that requires using libCEED with a non-NULL stream. Then we could wait on that stream at the beginning of CeedOperatorApplyAddComposite_[Hip/Cuda]_gen, before splitting into the separate streams for the suboperators. But that's a big change.

jeremylt · 2025-07-03T21:28:57Z

This change is supposed to guard the situation where

User hands us a device memory pointer and tells us to copy the data into a CeedVector
We request hipMemcpyDeviceToDevice on the NULL stream
The user then requests us to apply a composite operator
We launch the suboperators on their own streams so they can run in parallel
We have no idea if the hipMemcpyDeviceToDevice actually executed at this point or not

hipMemcpyDeviceToDevice is a rather infrequent call - we don't really use it internally, so I'd rather put the performance impact on the infrequent call instead of slowing down composite operator application since that is the core thing that libCEED has to do fast.

nbeams · 2025-07-03T21:53:31Z

What about the same scenario, but where we replace steps 1 and 2 with "user launches another operation on the NULL stream that affects the input data of the operator"? We'd have the same problem of not waiting for it to be done before starting to apply the suboperators in their own streams, but the sync in devicetodevice copy wouldn't help us (because we're not calling that).

I guess we could've always had this problem if the user was doing something else in a non-NULL stream to input data for a libCEED function. (But then I would also say it's the user's responsibility to add the sync, because they're the ones using other streams, when libCEED was consistent about always using NULL.) Now, by using other streams for the suboperators, have we added the possibility that even if they're using the NULL stream only, there could be ordering problems when applying a libCEED Operator? I guess the CeedVector state checking would cause an error if we were limiting ourselves solely to libCEED calls (e.g., output of one operator because input to another). But what if a libCEED Vector is using a data array that other code/libraries also have access to, and could be modifying prior to the libCEED Operator apply?

jeremylt · 2025-07-03T21:58:55Z

If a user is bypassing our access model, there's nothing we can do about that. It will always be a bug.

have we added the possibility that even if they're using the NULL stream only, there could be ordering problems when applying a libCEED Operator?

No, we synchronize before returning

nbeams · 2025-07-04T03:46:56Z

No, we synchronize before returning

I know that, the question was just a rephrasing of my point about things not finishing before the Operator applies (i.e., that's where we introduce the ordering problem). In fact, I take back what I said about the Vector state monitoring, because CeedVectorRestoreArray also doesn't wait on any kernels to finish, it's just a host operation, right? So a previous CeedOperator could have launched its last/only kernel, writing output into a Vector, and then the composite operator starts with the same Vector as input, but its data isn't finished being updated on the device yet. When everyone uses the same stream, it doesn't matter; if we introduce other streams, it (theoretically) could?

zatkins-dev · 2025-07-18T00:17:45Z

I'm going to say that this is probably fine; if we later observe any strange behavior, we can reopen.

zatkins-dev requested a review from jeremylt July 2, 2025 16:44

zatkins-dev self-assigned this Jul 2, 2025

zatkins-dev added 1-In Review HIP labels Jul 2, 2025

Add sychronization calls after hipMemcpyDeviceToDevice

ced1c5d

zatkins-dev force-pushed the zach/hip-sync-dev-to-dev branch from ec70275 to ced1c5d Compare July 2, 2025 16:45

zatkins-dev closed this Jul 18, 2025

Add sychronization calls after hipMemcpyDeviceToDevice #1850

Add sychronization calls after hipMemcpyDeviceToDevice #1850

Uh oh!

Conversation

zatkins-dev commented Jul 2, 2025

Uh oh!

jeremylt commented Jul 2, 2025

Uh oh!

zatkins-dev commented Jul 3, 2025

Uh oh!

zatkins-dev commented Jul 3, 2025

Uh oh!

nbeams commented Jul 3, 2025

Uh oh!

jeremylt commented Jul 3, 2025

Uh oh!

zatkins-dev commented Jul 3, 2025

Uh oh!

nbeams commented Jul 3, 2025

Uh oh!

jeremylt commented Jul 3, 2025

Uh oh!

nbeams commented Jul 3, 2025

Uh oh!

nbeams commented Jul 3, 2025

Uh oh!

jeremylt commented Jul 3, 2025

Uh oh!

nbeams commented Jul 3, 2025

Uh oh!

jeremylt commented Jul 3, 2025

Uh oh!

nbeams commented Jul 3, 2025

Uh oh!

jeremylt commented Jul 3, 2025

Uh oh!

nbeams commented Jul 3, 2025

Uh oh!

jeremylt commented Jul 3, 2025

Uh oh!

nbeams commented Jul 4, 2025

Uh oh!

zatkins-dev commented Jul 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants