Basic implementation of dynamic LoRA adapters placement, based on shuffle sharding algorithm by dmitripikus · Pull Request #720 · llm-d/llm-d-inference-scheduler

dmitripikus · 2026-03-15T18:17:20Z

Here is a basic implementation of dynamic LoRA adapters placement, based on shuffle sharding algorithm
This implementation is for experimentation and feedback.

At this point:

Solution is implemented as a scorer that causes LoRA adapters to be loaded (and serve requests) at specific vLLM endpoints
Re-balancing of LoRA adapters in a case of endpoints restarts, scaling up/down will be added in the following PR
Currently base model is received by scorer as a parameter. Base model discovery will be added in the following PR

Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>

ahg-g · 2026-03-15T19:20:45Z

What kind of validation and benchmarking are we doing to validate this algorithm? what is the baseline to compare against?

dmitripikus · 2026-03-16T12:10:36Z

Hi @ahg-g, Thanks for your comment!
I responded in #709 (comment)

hexfusion

i understand this PR is experimental for benchmarking added a few things to consider as you iterate.

hexfusion · 2026-03-16T13:16:01Z

pkg/plugins/scorer/lora_aware.go

+	shardCacheMu        sync.RWMutex        // Protects shardCache
+	cachedShardSize     int                 // Cached calculated shard size
+	cachedEndpointCount int                 // Number of endpoints for cached shard size
+	shardSizeMu         sync.RWMutex        // Protects shard size cache


nit: these two caches are coupled, hard assignments depend on shard size. a single mutex would make invalidation simpler when you add rebalancing.

Fixed. Thank you!

hexfusion · 2026-03-16T13:19:59Z

pkg/plugins/scorer/lora_aware.go

+
+	// Take the top shardSize endpoints from the shuffled list
+	result := make([]scheduling.Endpoint, shardSize)
+	copy(result, shuffled[:shardSize])


nit: logically we are making 3 allocations and 2 full copies of the endpoint list, how big could this list get?

Fixed. The fix reduces getShardForAdapter from 3 allocations + 2 full copies down to 1 allocation + 1 copy by sorting and shuffling in-place on a single copy, then returning a sub-slice.
The list could reach hundreds of vLLM pods in a production cluster, so redundant allocations are wasteful.
Thank you!

kfswain · 2026-03-16T22:28:50Z

Hyper nit: can we move this PR to draft or add a WIP tag?

I think its great to have it open so we can discuss the algo, but while we are still in development, would be good to signal to others this is WIP

…Count Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>

Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>

github-actions · 2026-04-08T01:41:58Z

This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the lifecycle/stale label.

dmitripikus added 4 commits March 12, 2026 16:23

Primary draft of 'lora_aware' scorer and corresponding test

e240e52

Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>

Fixes in lora_aware adapter tests

9f8dfd4

Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>

Merge remote-tracking branch 'origin/main' into loras-shuffle-sharding

768d1ff

LoRA-aware scorer is registered

888c578

Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>

github-project-automation bot added this to llm-d-inference-scheduler Mar 15, 2026

github-actions bot requested review from elevran and shmuelk March 15, 2026 18:17

dmitripikus mentioned this pull request Mar 16, 2026

Dynamic LoRA placement/balancing in InferencePool (beyond “route to where it’s loaded”) #709

Open

hexfusion reviewed Mar 16, 2026

View reviewed changes

dmitripikus marked this pull request as draft March 17, 2026 11:16

dmitripikus added 2 commits March 17, 2026 14:30

Single mutex protects shardCache, cachedShardSize, and cachedEndpoint…

7c44c88

…Count Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>

Allocations and copies number is reduced in getShardForAdapter()

791d950

Signed-off-by: Dmitri Pikus <DPIKUS@il.ibm.com>

elevran added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2026

github-actions bot added the lifecycle/stale label Apr 8, 2026

elevran removed the lifecycle/stale label Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic implementation of dynamic LoRA adapters placement, based on shuffle sharding algorithm#720

Basic implementation of dynamic LoRA adapters placement, based on shuffle sharding algorithm#720
dmitripikus wants to merge 6 commits intollm-d:mainfrom
dmitripikus:loras-shuffle-sharding

dmitripikus commented Mar 15, 2026

Uh oh!

ahg-g commented Mar 15, 2026

Uh oh!

dmitripikus commented Mar 16, 2026

Uh oh!

hexfusion left a comment

Uh oh!

hexfusion Mar 16, 2026

Uh oh!

dmitripikus Mar 17, 2026

Uh oh!

hexfusion Mar 16, 2026

Uh oh!

dmitripikus Mar 17, 2026

Uh oh!

kfswain commented Mar 16, 2026

Uh oh!

github-actions bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

dmitripikus commented Mar 15, 2026

Uh oh!

ahg-g commented Mar 15, 2026

Uh oh!

dmitripikus commented Mar 16, 2026

Uh oh!

hexfusion left a comment

Choose a reason for hiding this comment

Uh oh!

hexfusion Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

dmitripikus Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

hexfusion Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

dmitripikus Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

kfswain commented Mar 16, 2026

Uh oh!

github-actions bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants