Skip to content

Conversation

@zatkins-dev
Copy link
Collaborator

Turns out that hipMemcpy with hipMemcpyDeviceToDevice is asynchronous on the host. This MR adds synchronization calls to prevent any potential race conditions.

@zatkins-dev zatkins-dev force-pushed the zach/hip-sync-dev-to-dev branch from ec70275 to ced1c5d Compare July 2, 2025 16:45
@jeremylt
Copy link
Member

jeremylt commented Jul 2, 2025

Can you check if CUDA needs the same?

@zatkins-dev
Copy link
Collaborator Author

I looked and honestly I have no idea. I can check more next week.

@zatkins-dev
Copy link
Collaborator Author

This thread has some insight, the answer is we probably don't need to sync, but it isn't host synchronous. I'm not sure if hip is the same way, i.e. if it's okay that it's not synchronous on host.

https://forums.developer.nvidia.com/t/are-cudamemcpy-and-cudamalloc-blocking-synchronous/308368/2

@nbeams
Copy link
Contributor

nbeams commented Jul 3, 2025

Did you have a use case for HIP where you saw a problem from this?

I believe we only ever use one queue for CUDA/HIP, right? Then I don't think it should matter if it's asynchronous with respect to host -- the host can move on and do other commands without waiting on the call to finish, but anything else using that memory on the device will be submitted to the same queue and have to "wait its turn" before starting. Same for the beginning of the copy, it will wait for previous kernels to finish because they're all using the same queue, right?

@jeremylt
Copy link
Member

jeremylt commented Jul 3, 2025

We run the suboperators of a composite operator on separate streams.

What would happen if we

  1. do a device to device transfer that doesn't immediately resolve
  2. get the address of the (stale) memory, believing its up to date
  3. launch an operator kernel with this stale memory

@zatkins-dev
Copy link
Collaborator Author

I don't entirely understand null stream semantics, but I think that null stream operations block all other streams?

@nbeams
Copy link
Contributor

nbeams commented Jul 3, 2025

We run the suboperators of a composite operator on separate streams.

Wait, how does that work? I thought only the SYCL backends implemented CeedSetStream (since there's no such thing as a NULL stream for SYCL)?

@jeremylt
Copy link
Member

jeremylt commented Jul 3, 2025

6eee1ff

@nbeams
Copy link
Contributor

nbeams commented Jul 3, 2025

Ah, I see. So in general, a Ceed object is associated with one stream only, with everything submitted there, with the suboperators being an exception? But then for each suboperator, everything happens in its own stream.

Looking at the code in 6eee1ff, it looks like we're already syncing each stream before moving on, so I don't see a problem. I assume there's no "crossing the streams" between the suboperators which is why we can put them in separate streams to start? Even if there were dependencies, it'd be better to have a way to handle that other than putting a whole device sync in CeedSetDeviceGenericArray, which would be more heavy-handed than we really need.

@nbeams
Copy link
Contributor

nbeams commented Jul 3, 2025

Oh wait, you mean on the "other side", i.e., before we even start the suboperators in their streams?

@jeremylt
Copy link
Member

jeremylt commented Jul 3, 2025

Right, we could theoretically copy data from a user's device pointer to our own before executing a suboperator with that memory

@nbeams
Copy link
Contributor

nbeams commented Jul 3, 2025

From the user's perspective, the Ceed object is always/only using the NULL stream (for CUDA/HIP). If they are trying to give us input memory that's being operated on by another stream, I'd say it's their responsibility to put the sync command before calling a libCEED operation on it, no?

@jeremylt
Copy link
Member

jeremylt commented Jul 3, 2025

We're the ones doing the operating

@nbeams
Copy link
Contributor

nbeams commented Jul 3, 2025

By "we", do you mean libCEED itself -- but as called by some other code/application (e.g., could be previous calls to libCEED that haven't finished yet)?

Right now, the operator apply with suboperators is the outlier that behaves differently than the other backend functions for CUDA/HIP, so I think ideally, it should be that function's "responsibility" to handle it, rather than forcing a device sync on every devicetodevice copy call (which might mean we end up waiting on another queue that has no impact on us). But to make sure that previous libCEED calls are finished before we split into the other streams for the suboperators, we'd have to add a sync on the NULL stream (which all the other libCEED calls use)...which has the same effect, since it forces any other streams to wait, too. And that would probably be worse in terms of performance since I don't know how often the devicetodevice copy is actually getting called in a typical workflow...

Unless you want to implement CeedSetStream for CUDA/HIP, and say that if you want to use composite operators, that requires using libCEED with a non-NULL stream. Then we could wait on that stream at the beginning of CeedOperatorApplyAddComposite_[Hip/Cuda]_gen, before splitting into the separate streams for the suboperators. But that's a big change.

@jeremylt
Copy link
Member

jeremylt commented Jul 3, 2025

This change is supposed to guard the situation where

  1. User hands us a device memory pointer and tells us to copy the data into a CeedVector
  2. We request hipMemcpyDeviceToDevice on the NULL stream
  3. The user then requests us to apply a composite operator
  4. We launch the suboperators on their own streams so they can run in parallel
  5. We have no idea if the hipMemcpyDeviceToDevice actually executed at this point or not

hipMemcpyDeviceToDevice is a rather infrequent call - we don't really use it internally, so I'd rather put the performance impact on the infrequent call instead of slowing down composite operator application since that is the core thing that libCEED has to do fast.

@nbeams
Copy link
Contributor

nbeams commented Jul 3, 2025

What about the same scenario, but where we replace steps 1 and 2 with "user launches another operation on the NULL stream that affects the input data of the operator"? We'd have the same problem of not waiting for it to be done before starting to apply the suboperators in their own streams, but the sync in devicetodevice copy wouldn't help us (because we're not calling that).

I guess we could've always had this problem if the user was doing something else in a non-NULL stream to input data for a libCEED function. (But then I would also say it's the user's responsibility to add the sync, because they're the ones using other streams, when libCEED was consistent about always using NULL.) Now, by using other streams for the suboperators, have we added the possibility that even if they're using the NULL stream only, there could be ordering problems when applying a libCEED Operator? I guess the CeedVector state checking would cause an error if we were limiting ourselves solely to libCEED calls (e.g., output of one operator because input to another). But what if a libCEED Vector is using a data array that other code/libraries also have access to, and could be modifying prior to the libCEED Operator apply?

@jeremylt
Copy link
Member

jeremylt commented Jul 3, 2025

If a user is bypassing our access model, there's nothing we can do about that. It will always be a bug.

have we added the possibility that even if they're using the NULL stream only, there could be ordering problems when applying a libCEED Operator?

No, we synchronize before returning

@nbeams
Copy link
Contributor

nbeams commented Jul 4, 2025

No, we synchronize before returning

I know that, the question was just a rephrasing of my point about things not finishing before the Operator applies (i.e., that's where we introduce the ordering problem). In fact, I take back what I said about the Vector state monitoring, because CeedVectorRestoreArray also doesn't wait on any kernels to finish, it's just a host operation, right? So a previous CeedOperator could have launched its last/only kernel, writing output into a Vector, and then the composite operator starts with the same Vector as input, but its data isn't finished being updated on the device yet. When everyone uses the same stream, it doesn't matter; if we introduce other streams, it (theoretically) could?

@zatkins-dev
Copy link
Collaborator Author

I'm going to say that this is probably fine; if we later observe any strange behavior, we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants