Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary copy in TensorSource #8849

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

lsy323
Copy link
Collaborator

@lsy323 lsy323 commented Mar 18, 2025

Avoid at::Tensor copy in TensorSource if it's not necessary.

The copy operations are needed under 2 cases:

  1. On XLA:GPU path, if tensor is on CUDA device, need to copy to CPU then pass to PJRT runtime to transfer to GPU. This is because @ysiraichi found passing CUDA tensor to PJRT runtime doesn't work, so doing the roundtrip as a workaround.
  2. On XLA:TPU path, if the tensor is not contiguous, need to use the copy to make the memory contiguous. Because PJRT takes raw data ptr which expects data to be contiguous.

The copy operation needs to be blocking, since the transfer operation depends on the copied tensor.

@lsy323 lsy323 changed the title avoid unnecessary copy in tensorsource Avoid unnecessary copy in TensorSource Mar 19, 2025
@lsy323 lsy323 force-pushed the lsiyuan/avoid-blocking-copy-tensorsource branch 2 times, most recently from 2b3b31f to 636a787 Compare March 19, 2025 17:24
@lsy323 lsy323 force-pushed the lsiyuan/avoid-blocking-copy-tensorsource branch from 636a787 to e483f51 Compare March 19, 2025 17:25
@lsy323 lsy323 marked this pull request as ready for review March 19, 2025 17:26
@lsy323 lsy323 requested review from ysiraichi and yaochengji March 19, 2025 17:26
@lsy323
Copy link
Collaborator Author

lsy323 commented Mar 19, 2025

Hi @ysiraichi, just follow up on offline discussion on the copy operation. PTAL at the PR, thanks!

Copy link
Collaborator

@ysiraichi ysiraichi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, we can use the DLPack machinery for doing the CUDA to XLA:CUDA transfer (that wasn't implemented at the time I worked on this). I will open an issue for this.

Comment on lines 57 to 73
// The purposes of copy are:
// 1. Ensure the memory is contiguous, which is expected by PJRT.
// 2. Move CUDA tensor to CPU since we cannot pass CUDA memory to PJRT now.
// 3. Cast data type.
// We can avoid if copy is not needed.
if (tensor.device() == at::kCPU && tensor.is_contiguous() &&
tensor.dtype() == target_torch_type) {
tensor_ = std::move(tensor);
} else {
// TODO(ysiraichi): check, first, if tensor lives in a device that the
// current PjRt client has access. If so, we don't need to go through the
// CPU.
tensor_ = std::move(tensor.to(
at::TensorOptions().device(at::kCPU).dtype(target_torch_type),
/*non_blocking=*/false,
/*copy=*/true, at::MemoryFormat::Contiguous));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand it, tensor.to(...) (without the copy argument) already checks whether it should actually copy or not. So, what do you think of reverting to the old tensor.to(...) usage, but removing the copy argument, instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ysiraichi, I didn't find a tensor.to(...) without the copy arg in C++, is it only in python?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants