-
Notifications
You must be signed in to change notification settings - Fork 6.6k
[serve.llm] Prefix-aware scheduler [1/N] Adding Prefix-aware tree data structure #52747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Please add comprehensive descriptions to your PR. Example Left a few comments.
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I let @GeneDer and @eicherseiji review the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Just kinda feel we shouldn't put all the lock in the read path and should only be in the write path. Also remember Cody proposed using 2 trees for this during brainstorming, was there some other discussion concluded to just use one tree with locks? Are performance not impacted bc of this?
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
|
||
return removed_chars_len | ||
|
||
def evict_tenant_by_LRU(self, tenant: str, min_remove_size: int) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the method name all lower cased
def evict_tenant_by_LRU(self, tenant: str, min_remove_size: int) -> int: | |
def evict_tenant_by_lru(self, tenant: str, min_remove_size: int) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also who's the caller for this method? Should there be a background task running this?
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
Signed-off-by: Justin Ji <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks so much nicer, great job! Would recommend to break up the test cases into individual tests instead of jamming them all into one test. Can further organized them into test classes if you see fit. Like a test class for TestPrefixMatch, then you have multiple tests one for test_no_match, another for test_non_existing_prefix...etc, so those tests are more manageable
python/ray/llm/_internal/serve/replica_scheduler/prefix_aware/prefix_tree.py
Show resolved
Hide resolved
python/ray/llm/_internal/serve/replica_scheduler/prefix_aware/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/replica_scheduler/prefix_aware/prefix_tree.py
Outdated
Show resolved
Hide resolved
assert h_node.edge_label_to_child.get("w").text == "world" | ||
assert h_node.edge_label_to_child.get("t").text == "there" | ||
|
||
# 4. Test that inserting a longer prompt with shared prefix doesn't create empty text nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would suggest to break out all those into multiple tests. Making it easier to debug if anything failed and not need to reset everytime. Just make sure to give it meaningful names, like test_insert_long_prompt...
matched_text, matched_tenants = tree.prefix_match("application", ["tenant_3"]) | ||
assert matched_text == "" and matched_tenants is None | ||
|
||
# 7. Test shared prefix matching with branches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly break those out into it's own test
… because of pickle error Signed-off-by: Justin Ji <[email protected]>
python/ray/llm/_internal/serve/replica_scheduler/prefix_aware/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/replica_scheduler/prefix_aware/prefix_tree.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/replica_scheduler/prefix_aware/prefix_tree.py
Outdated
Show resolved
Hide resolved
@GeneDer please review this one more time. |
Yea, still pending @jujipotle to address the comments. |
Signed-off-by: Justin Ji <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STMPING in the last round.
…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]> Signed-off-by: weiran11 <[email protected]>
…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]> Signed-off-by: zhaoch23 <[email protected]>
…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]> Signed-off-by: iamjustinhsu <[email protected]>
…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]>
…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]> Signed-off-by: Vicky Tsang <[email protected]>
…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]> Signed-off-by: Scott Lee <[email protected]>
Why are these changes needed?
The default replica scheduler in
ray.serve
uses a Power-of-Two Choices strategy, which does not take advantage of LLMs' KV cache behavior during scheduling. To improve inference speed in LLM workloads, we introduce a prefix-aware scheduler specifically forray.serve.llm
.This scheduler requires a data structure that can efficiently track the prefixes of text each replica (tenant) has processed — including when they were accessed — in order to route new requests to the most appropriate replica based on prefix match hit rate. We introduce a custom PrefixTree (approximate radix tree) data structure to serve this purpose, with efficient insertion, prefix matching, and LRU eviction.
Features:
When a new replica comes online,
PrefixTree.insert()
will automatically add it to the tree. When a replica is removed,PrefixTree.remove_tenant()
deletes all associated state. This makes up-scaling and down-scaling seamless for the scheduler.The tree supports simultaneous tracking of all active replicas in a single shared structure. This allows the scheduler to make prefix-matching or eviction decisions across any subset of replicas while maintaining a single remote
PrefixTree
object.The tree exposes a method to identify the tenant (replica) with the smallest KV cache footprint, allowing the scheduler to implement fallback heuristics based on cache size.
To simulate eviction behavior inside a replica’s KV cache, the tree supports per-tenant LRU eviction using a doubly linked list. Insertions and evictions are O(1) per node.
The accompanying
test_prefix_tree.py
file provides a detailed and thorough test suite for the data structure, covering edge cases, eviction behavior, prefix matching, and structural invariants. Developers are encouraged to read the test cases to understand usage and guarantees.This PR introduces a standalone
PrefixTree
module that will be integrated intoserve.llm
as part of thePrefixAwareReplicaScheduler
in subsequent PRs.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.