Skip to content

[distributed][perf] ensure that all decoding ops are happening on gpu with no cpu sync #1147

Open
@lessw2020

Description

@lessw2020

🐛 Describe the bug

per @kwen2501 - when we are doing decoding step:

next_token = torch.tensor([decode_results[0][0]], device=device)

"nit: I am not sure if the use of torch.tensor here would cause a sync from GPU to CPU (to get the scalar) then move to the GPU again (to create the tensor).
If there is no use of next_token in CPU domain, better to just use index op here.

Or, is decode_results already on CPU? Hmm, then we'd need to think about how to arrange these CPU ops and GPU ops. Ideally, you would like to fire the send right after step()."

Versions

n/a

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions