Skip to content

Use multiple logical devices to handle ClientGenerateBatchProcess #1403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

dezhiAmd
Copy link

@dezhiAmd dezhiAmd commented May 7, 2025

  • Why
    Decode process is not using the full capacity of GPU. The goal of this change is to add multiple logical devices so that multiple request/ClientGenerateBatchProcess can be handled by logical devices in a round-robin fashion

  • How

  1. By setting environment variable "SHORTFIN_AMDGPU_LOGICAL_DEVICES_PER_PHYSICAL_DEVICE" as number of workers specified from input arguments, (For example --workers 2), system can create one logical device for each worker

  2. Add generate_count to class ClientGenerateBatchProcess to cache the number of total requests, use it to control which device to select

dezhliao and others added 15 commits April 24, 2025 09:35
Signed-off-by: dezhliao <[email protected]>
@dezhiAmd dezhiAmd force-pushed the ping_pong branch 2 times, most recently from 9cfade7 to 9b096a3 Compare May 7, 2025 00:30
Copy link
Contributor

@vinayakdsci vinayakdsci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @stellaraccident @daveliddell.

@dezhiAmd there is an assumption in the code that we should just always queue kernel invocations on both streams akin to replicating it. I don't think that is what we want to do. Round-robin assignments have the potential to be expensive especially when the server starts receiving many requests at the same time.

There are a lot of ways multiple streams can be used when running on the same physical device, that can make execution much faster than the conventional replicate-and-invoke-identical-work idea.
IMO, the underlying idea behind the patch is correct, but I would re-think the implementation. Ideally it should not make assumptions, and we could build a framework that allows us to do smarter queuing onto the streams.

We also need to have a safety mechanism in place that ensures that we do not cross address boundaries in case a user runs with multiple physical devices visible to the System.

Signed-off-by: dezhliao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants