feat: Add a scoring plugin to distribute new groups evenly#357
feat: Add a scoring plugin to distribute new groups evenly#357vMaroon merged 18 commits intollm-d:mainfrom
Conversation
bc08bee to
5ee44e1
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR adds a new scoring plugin called "no-hit-lru-scorer" that helps distribute new groups evenly across pods by tracking which pods have least recently handled cache-miss requests. The plugin uses an LRU cache to give higher scores to pods that haven't recently served new groups, helping to balance cache growth.
Key changes:
- Implements NoHitLRU scorer that tracks pod usage for cache-miss requests
- Adds comprehensive test coverage for the new plugin functionality
- Integrates the plugin into the registration system with sample configuration
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/plugins/scorer/no_hit_lru.go | Core implementation of the NoHitLRU scorer plugin |
| pkg/plugins/scorer/no_hit_lru_test.go | Comprehensive test suite covering functionality and edge cases |
| pkg/plugins/register.go | Registers the new plugin in the plugin system |
| deploy/config/sim-epp-no-hit-lru.yaml | Sample configuration demonstrating plugin usage |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
1a56bda to
1e04f9d
Compare
vMaroon
left a comment
There was a problem hiding this comment.
Overall looks good - a few comments.
pkg/plugins/scorer/no_hit_lru.go
Outdated
| // - LRU ordering is with respect to when a pod last received a cold request. | ||
| // - Least recently used (or never used) pods get highest score (1.0) | ||
| // - Most recently used pods get lowest score (approaching 0.0) | ||
| func (s *NoHitLRU) Score(ctx context.Context, cycleState *types.CycleState, request *types.LLMRequest, pods []types.Pod) map[types.Pod]float64 { |
There was a problem hiding this comment.
I think it might help readability if the main logical blocks in this call are separated to different functions.
|
|
||
| // Read prefix cache state to determine if this is a cold request | ||
| // This is treated as an optimization - if the state isn't available, we assume cold request | ||
| prefixState, err := types.ReadCycleStateKey[*prefix.SchedulingContextState](cycleState, plugins.StateKey(s.prefixPluginName)) |
There was a problem hiding this comment.
Not a blocker but would be great: do you think you can extend and support the same optimization for the precise-prefix-cache-scorer too? I personally am eager to use this scorer with the latter.
It is also possible that this optimization can be achieved with unequal weighting (e.g., ratio of 2:1 between prefix-cache scorers and this one) - but could also be fragile in some cases.
There was a problem hiding this comment.
Great idea. Maybe I can add this in a follow up PR? I'm a bit worried about mixing concerns.
pkg/plugins/scorer/no_hit_lru.go
Outdated
| type NoHitLRU struct { | ||
| typedName plugins.TypedName | ||
| lruCache *lru.Cache[string, struct{}] // pod name -> dummy value (we only care about order) | ||
| mutex *sync.RWMutex |
There was a problem hiding this comment.
the lru.Cache package is thread-safe, the Add and Keys calls are protected by an internal lock. The use of this mutex is therefore purely for snapshot semantics: If Keys() is called, the mutex makes sure no other request will progress before the scoring. Might be good to test the effect of this locking and profile the overhead as well.
There was a problem hiding this comment.
Only with unit tests, but I did do a benchmark and added a test with multiple threads. Results support removing mutex imo. Being stale by 1ms on rare occasions shouldn't significantly break the intent behind this -- cache balancing. Here are benchmark results:
https://gist.github.com/usize/b936f67655029c05ba701218c1bfe5da
|
@usize looks good overall, thank you for the contribution! I think we're just missing updates to the architecture doc which contains info about plugins and their configuration - then good to do. |
|
TYSM, I'll follow up with the "phase 2" that the task described, and also file an issue and make a patch for the the improvement mentioned here: #357 (comment) 🙏🏻 |
| - `lruSize` (optional): The maximum number of pods to track in the LRU cache. Defaults to 1024. | ||
|
|
||
| **Note:** This scorer is designed to work alongside a prefix cache scorer (such as `prefix-cache-scorer` or | ||
| `precise-prefix-cache-scorer`). If no prefix cache state is available, all requests are treated as cold. |
There was a problem hiding this comment.
I think following with a yaml example of what a configuration looks like is helpful here. Additionally, should state that when integrating with a prefix-cache scorer, the latter should be defined first in the scheduling profile.
|
Great work @usize, thank you! /lgtm |
|
@usize can you verify the commits? See this for more info https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification |
When a request results in no cache hit it opens a new group. New groups are likely to disproportiately grow the cache where they land. To avoid uneven cache growth, a new scorer prefers to send new groups to pods that have least recently handled a request from a new group. Signed-off-by: usize <mofoster@redhat.com>
Signed-off-by: usize <mofoster@redhat.com>
Signed-off-by: usize <mofoster@redhat.com>
Signed-off-by: usize <mofoster@redhat.com>
If no prefix cache state is available, treat all requests as cold. This will still allow for some benefit in terms of balancing. Signed-off-by: usize <mofoster@redhat.com>
Instead of inspecting the state of the LRU cache directly, it's possible to verify behavior via the usual scoring API. Signed-off-by: usize <mofoster@redhat.com>
The LRU Cache's size should be included in its error message. Signed-off-by: usize <mofoster@redhat.com>
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> Signed-off-by: usize <mofoster@redhat.com>
…llm-d#358) Signed-off-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: Nir Rozenbaum <nirro@il.ibm.com> Signed-off-by: usize <mofoster@redhat.com>
Signed-off-by: usize <mofoster@redhat.com>
Signed-off-by: Kellen Swain <kfswain@google.com> Signed-off-by: usize <mofoster@redhat.com>
Signed-off-by: usize <mofoster@redhat.com>
Signed-off-by: usize <mofoster@redhat.com>
Signed-off-by: usize <mofoster@redhat.com>
* bump llm-d-kv-cache-manager version (v0.3.2-rc1) Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> * bump kvc mgr version Signed-off-by: Maroon Ayoub <Maroonay@gmail.com> --------- Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> Signed-off-by: Maroon Ayoub <Maroonay@gmail.com> Signed-off-by: usize <mofoster@redhat.com>
Signed-off-by: usize <mofoster@redhat.com>
…llm-d#368) Bumps the go-dependencies group with 1 update: [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo). Updates `github.com/onsi/ginkgo/v2` from 2.25.3 to 2.26.0 - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.25.3...v2.26.0) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.26.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
The plugin uses a LRU cache to assign higher scores to pods that have less recently handled a request matching no existing prefixes (thus, a new group). This should help even cache growth, as new groups are likely to add more new blocks to the cache than existing groups.
For simplicity, this implementation relies on state produced by the prefix-cache-scorer plugin in order to track whether a request hits or misses the cache. Because of that, prefix-cache-scorer is required for this to work properly in production.
Fixes: #346