Update allocation and execution plan handling for CPU EP MemcpyToHost/MemcpyFromHost#27272
Open
Update allocation and execution plan handling for CPU EP MemcpyToHost/MemcpyFromHost#27272
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The current CPU EP provides the
MemcpyToHost/MemcpyFromHostkernel implementations and registry for plugin EPs, however, the allocation and execution plan does not account for the case that the producer/consumer nodes ofMemcpyToHost/MemcpyFromHostnodes (on cpu) are on other non-cpu devices.MemcpyToHostandMemcpyFromHostnodes provided by CPU EP require special handling in planning:Determining output device for
MemcpyFromHost:MemcpyFromHost node provided by the CPU EP requires special handling.
As per MemcpyFromHost kernel registration uses default memory type for output which means it uses CPU memory for output as it's run on CPU, but it actually may produce output on the device specific to its consumer node's EP.
So, we need to check the consumer node's EP and set the output device accordingly.
Make sure
MemcpyToHostuse WaitNotificationOnHost:There are typically two types of wait functions defined in the Notification class for plugin EPs or other provider-bridge EPs (e.g., CUDA EP and TRT EP):
(1) WaitNotificationOnDevice and (2) WaitNotificationOnHost
Note: MemcpyToHost nodes provided by the CPU EP require special handling.
If a MemcpyToHost node (running on the host) consumes a tensor produced by a device node, MemcpyToHost must use WaitNotificationOnHost, because the CPU device does not have a stream, which is required by WaitNotificationOnDevice.
Motivation and Context
#26088