Skip to content

[serve.llm] Prefix-aware scheduler [1/N] Adding Prefix-aware tree data structure #52747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 12, 2025

Conversation

jujipotle
Copy link
Contributor

@jujipotle jujipotle commented May 2, 2025

Why are these changes needed?

The default replica scheduler in ray.serve uses a Power-of-Two Choices strategy, which does not take advantage of LLMs' KV cache behavior during scheduling. To improve inference speed in LLM workloads, we introduce a prefix-aware scheduler specifically for ray.serve.llm.

This scheduler requires a data structure that can efficiently track the prefixes of text each replica (tenant) has processed — including when they were accessed — in order to route new requests to the most appropriate replica based on prefix match hit rate. We introduce a custom PrefixTree (approximate radix tree) data structure to serve this purpose, with efficient insertion, prefix matching, and LRU eviction.

Features:

  • Automatic support for autoscaling
    When a new replica comes online, PrefixTree.insert() will automatically add it to the tree. When a replica is removed, PrefixTree.remove_tenant() deletes all associated state. This makes up-scaling and down-scaling seamless for the scheduler.
  • Multi-replica tracking
    The tree supports simultaneous tracking of all active replicas in a single shared structure. This allows the scheduler to make prefix-matching or eviction decisions across any subset of replicas while maintaining a single remote PrefixTree object.
  • Smallest-tenant lookup
    The tree exposes a method to identify the tenant (replica) with the smallest KV cache footprint, allowing the scheduler to implement fallback heuristics based on cache size.
  • Efficient LRU eviction
    To simulate eviction behavior inside a replica’s KV cache, the tree supports per-tenant LRU eviction using a doubly linked list. Insertions and evictions are O(1) per node.
  • Comprehensive unit tests
    The accompanying test_prefix_tree.py file provides a detailed and thorough test suite for the data structure, covering edge cases, eviction behavior, prefix matching, and structural invariants. Developers are encouraged to read the test cases to understand usage and guarantees.

This PR introduces a standalone PrefixTree module that will be integrated into serve.llm as part of the PrefixAwareReplicaScheduler in subsequent PRs.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

jujipotle added 5 commits May 2, 2025 10:35
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label May 4, 2025
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Please add comprehensive descriptions to your PR. Example Left a few comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I let @GeneDer and @eicherseiji review the tests.

@kouroshHakha kouroshHakha changed the title [serve.llm] Add prefix tree class as precursor for prefix-aware scheduler [serve.llm] Prefix-aware scheduler [1/N] Adding Prefix-aware tree data structure May 4, 2025
@jujipotle jujipotle marked this pull request as ready for review May 5, 2025 17:07
@jujipotle jujipotle requested a review from a team as a code owner May 5, 2025 17:07
Copy link
Contributor

@GeneDer GeneDer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Just kinda feel we shouldn't put all the lock in the read path and should only be in the write path. Also remember Cody proposed using 2 trees for this during brainstorming, was there some other discussion concluded to just use one tree with locks? Are performance not impacted bc of this?


return removed_chars_len

def evict_tenant_by_LRU(self, tenant: str, min_remove_size: int) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the method name all lower cased

Suggested change
def evict_tenant_by_LRU(self, tenant: str, min_remove_size: int) -> int:
def evict_tenant_by_lru(self, tenant: str, min_remove_size: int) -> int:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also who's the caller for this method? Should there be a background task running this?

jujipotle added 4 commits May 5, 2025 18:10
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label May 7, 2025
jujipotle added 4 commits May 7, 2025 13:04
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Copy link
Contributor

@GeneDer GeneDer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks so much nicer, great job! Would recommend to break up the test cases into individual tests instead of jamming them all into one test. Can further organized them into test classes if you see fit. Like a test class for TestPrefixMatch, then you have multiple tests one for test_no_match, another for test_non_existing_prefix...etc, so those tests are more manageable

assert h_node.edge_label_to_child.get("w").text == "world"
assert h_node.edge_label_to_child.get("t").text == "there"

# 4. Test that inserting a longer prompt with shared prefix doesn't create empty text nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would suggest to break out all those into multiple tests. Making it easier to debug if anything failed and not need to reset everytime. Just make sure to give it meaningful names, like test_insert_long_prompt...

matched_text, matched_tenants = tree.prefix_match("application", ["tenant_3"])
assert matched_text == "" and matched_tenants is None

# 7. Test shared prefix matching with branches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly break those out into it's own test

@kouroshHakha
Copy link
Contributor

@GeneDer please review this one more time.

@kouroshHakha kouroshHakha requested a review from GeneDer May 11, 2025 23:25
@GeneDer
Copy link
Contributor

GeneDer commented May 11, 2025

@GeneDer please review this one more time.

Yea, still pending @jujipotle to address the comments.

Copy link
Contributor

@GeneDer GeneDer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! LGTM!

@masoudcharkhabi masoudcharkhabi added the serve Ray Serve Related Issue label May 12, 2025
@kouroshHakha kouroshHakha removed the community-contribution Contributed by the community label May 12, 2025
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STMPING in the last round.

@kouroshHakha kouroshHakha merged commit ba1a8df into ray-project:master May 12, 2025
5 checks passed
ran1995data pushed a commit to ran1995data/ray that referenced this pull request May 13, 2025
zhaoch23 pushed a commit to Bye-legumes/ray that referenced this pull request May 14, 2025
iamjustinhsu pushed a commit to iamjustinhsu/ray that referenced this pull request May 15, 2025
lk-chen pushed a commit to lk-chen/ray that referenced this pull request May 17, 2025
vickytsang pushed a commit to ROCm/ray that referenced this pull request Jun 3, 2025
rebel-scottlee pushed a commit to rebellions-sw/ray that referenced this pull request Jun 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-backlog go add ONLY when ready to merge, run all tests performance serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants