Basic implementation of dynamic LoRA adapters placement, based on shuffle sharding algorithm#720
Basic implementation of dynamic LoRA adapters placement, based on shuffle sharding algorithm#720dmitripikus wants to merge 6 commits intollm-d:mainfrom
Conversation
Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>
Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>
Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>
|
What kind of validation and benchmarking are we doing to validate this algorithm? what is the baseline to compare against? |
|
Hi @ahg-g, Thanks for your comment! |
hexfusion
left a comment
There was a problem hiding this comment.
i understand this PR is experimental for benchmarking added a few things to consider as you iterate.
pkg/plugins/scorer/lora_aware.go
Outdated
| shardCacheMu sync.RWMutex // Protects shardCache | ||
| cachedShardSize int // Cached calculated shard size | ||
| cachedEndpointCount int // Number of endpoints for cached shard size | ||
| shardSizeMu sync.RWMutex // Protects shard size cache |
There was a problem hiding this comment.
nit: these two caches are coupled, hard assignments depend on shard size. a single mutex would make invalidation simpler when you add rebalancing.
There was a problem hiding this comment.
Fixed. Thank you!
pkg/plugins/scorer/lora_aware.go
Outdated
|
|
||
| // Take the top shardSize endpoints from the shuffled list | ||
| result := make([]scheduling.Endpoint, shardSize) | ||
| copy(result, shuffled[:shardSize]) |
There was a problem hiding this comment.
nit: logically we are making 3 allocations and 2 full copies of the endpoint list, how big could this list get?
There was a problem hiding this comment.
Fixed. The fix reduces getShardForAdapter from 3 allocations + 2 full copies down to 1 allocation + 1 copy by sorting and shuffling in-place on a single copy, then returning a sub-slice.
The list could reach hundreds of vLLM pods in a production cluster, so redundant allocations are wasteful.
Thank you!
|
Hyper nit: can we move this PR to draft or add a WIP tag? I think its great to have it open so we can discuss the algo, but while we are still in development, would be good to signal to others this is WIP |
…Count Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>
Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>
|
This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the |
Here is a basic implementation of dynamic LoRA adapters placement, based on shuffle sharding algorithm
This implementation is for experimentation and feedback.
At this point: