Skip to content

Basic implementation of dynamic LoRA adapters placement, based on shuffle sharding algorithm#720

Draft
dmitripikus wants to merge 6 commits intollm-d:mainfrom
dmitripikus:loras-shuffle-sharding
Draft

Basic implementation of dynamic LoRA adapters placement, based on shuffle sharding algorithm#720
dmitripikus wants to merge 6 commits intollm-d:mainfrom
dmitripikus:loras-shuffle-sharding

Conversation

@dmitripikus
Copy link
Copy Markdown
Contributor

Here is a basic implementation of dynamic LoRA adapters placement, based on shuffle sharding algorithm
This implementation is for experimentation and feedback.

At this point:

  • Solution is implemented as a scorer that causes LoRA adapters to be loaded (and serve requests) at specific vLLM endpoints
  • Re-balancing of LoRA adapters in a case of endpoints restarts, scaling up/down will be added in the following PR
  • Currently base model is received by scorer as a parameter. Base model discovery will be added in the following PR

Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>
Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>
Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>
@ahg-g
Copy link
Copy Markdown

ahg-g commented Mar 15, 2026

What kind of validation and benchmarking are we doing to validate this algorithm? what is the baseline to compare against?

@dmitripikus
Copy link
Copy Markdown
Contributor Author

Hi @ahg-g, Thanks for your comment!
I responded in #709 (comment)

Copy link
Copy Markdown
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand this PR is experimental for benchmarking added a few things to consider as you iterate.

shardCacheMu sync.RWMutex // Protects shardCache
cachedShardSize int // Cached calculated shard size
cachedEndpointCount int // Number of endpoints for cached shard size
shardSizeMu sync.RWMutex // Protects shard size cache
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these two caches are coupled, hard assignments depend on shard size. a single mutex would make invalidation simpler when you add rebalancing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thank you!


// Take the top shardSize endpoints from the shuffled list
result := make([]scheduling.Endpoint, shardSize)
copy(result, shuffled[:shardSize])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: logically we are making 3 allocations and 2 full copies of the endpoint list, how big could this list get?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The fix reduces getShardForAdapter from 3 allocations + 2 full copies down to 1 allocation + 1 copy by sorting and shuffling in-place on a single copy, then returning a sub-slice.
The list could reach hundreds of vLLM pods in a production cluster, so redundant allocations are wasteful.
Thank you!

@kfswain
Copy link
Copy Markdown
Collaborator

kfswain commented Mar 16, 2026

Hyper nit: can we move this PR to draft or add a WIP tag?

I think its great to have it open so we can discuss the algo, but while we are still in development, would be good to signal to others this is WIP

@dmitripikus dmitripikus marked this pull request as draft March 17, 2026 11:16
…Count

Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>
Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>
@elevran elevran added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the lifecycle/stale label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants