-
Notifications
You must be signed in to change notification settings - Fork 51
Use multiple logical devices to handle ClientGenerateBatchProcess #1403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: dezhliao <[email protected]>
Signed-off-by: dezhliao <[email protected]>
Signed-off-by: dezhliao <[email protected]>
This reverts commit b9b74f7.
Signed-off-by: dezhliao <[email protected]>
…n logical device for now Signed-off-by: dezhliao <[email protected]>
…evices Signed-off-by: dezhliao <[email protected]>
Signed-off-by: dezhliao <[email protected]>
…est coming in Signed-off-by: dezhliao <[email protected]>
Signed-off-by: dezhliao <[email protected]>
Signed-off-by: dezhliao <[email protected]>
…n select logical device Signed-off-by: dezhliao <[email protected]>
9cfade7
to
9b096a3
Compare
Signed-off-by: dezhliao <[email protected]>
Signed-off-by: dezhliao <[email protected]>
Signed-off-by: dezhliao <[email protected]>
Signed-off-by: dezhliao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @stellaraccident @daveliddell.
@dezhiAmd there is an assumption in the code that we should just always queue kernel invocations on both streams akin to replicating it. I don't think that is what we want to do. Round-robin assignments have the potential to be expensive especially when the server starts receiving many requests at the same time.
There are a lot of ways multiple streams can be used when running on the same physical device, that can make execution much faster than the conventional replicate-and-invoke-identical-work idea.
IMO, the underlying idea behind the patch is correct, but I would re-think the implementation. Ideally it should not make assumptions, and we could build a framework that allows us to do smarter queuing onto the streams.
We also need to have a safety mechanism in place that ensures that we do not cross address boundaries in case a user runs with multiple physical devices visible to the System.
Signed-off-by: dezhliao <[email protected]>
Why
Decode process is not using the full capacity of GPU. The goal of this change is to add multiple logical devices so that multiple request/ClientGenerateBatchProcess can be handled by logical devices in a round-robin fashion
How
By setting environment variable "SHORTFIN_AMDGPU_LOGICAL_DEVICES_PER_PHYSICAL_DEVICE" as number of workers specified from input arguments, (For example --workers 2), system can create one logical device for each worker
Add generate_count to class ClientGenerateBatchProcess to cache the number of total requests, use it to control which device to select