[serve.llm] Prefix-aware scheduler [1/N] Adding Prefix-aware tree data structure #52747

jujipotle · 2025-05-02T17:36:24Z

Why are these changes needed?

The default replica scheduler in ray.serve uses a Power-of-Two Choices strategy, which does not take advantage of LLMs' KV cache behavior during scheduling. To improve inference speed in LLM workloads, we introduce a prefix-aware scheduler specifically for ray.serve.llm.

This scheduler requires a data structure that can efficiently track the prefixes of text each replica (tenant) has processed — including when they were accessed — in order to route new requests to the most appropriate replica based on prefix match hit rate. We introduce a custom PrefixTree (approximate radix tree) data structure to serve this purpose, with efficient insertion, prefix matching, and LRU eviction.

Features:

Automatic support for autoscaling
When a new replica comes online, PrefixTree.insert() will automatically add it to the tree. When a replica is removed, PrefixTree.remove_tenant() deletes all associated state. This makes up-scaling and down-scaling seamless for the scheduler.
Multi-replica tracking
The tree supports simultaneous tracking of all active replicas in a single shared structure. This allows the scheduler to make prefix-matching or eviction decisions across any subset of replicas while maintaining a single remote PrefixTree object.
Smallest-tenant lookup
The tree exposes a method to identify the tenant (replica) with the smallest KV cache footprint, allowing the scheduler to implement fallback heuristics based on cache size.
Efficient LRU eviction
To simulate eviction behavior inside a replica’s KV cache, the tree supports per-tenant LRU eviction using a doubly linked list. Insertions and evictions are O(1) per node.
Comprehensive unit tests
The accompanying test_prefix_tree.py file provides a detailed and thorough test suite for the data structure, covering edge cases, eviction behavior, prefix matching, and structural invariants. Developers are encouraged to read the test cases to understand usage and guarantees.

This PR introduces a standalone PrefixTree module that will be integrated into serve.llm as part of the PrefixAwareReplicaScheduler in subsequent PRs.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Ji <[email protected]>

kouroshHakha

Cool. Please add comprehensive descriptions to your PR. Example Left a few comments.

python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py

kouroshHakha · 2025-05-04T19:50:52Z

python/ray/llm/tests/serve/cpu/deployments/test_prefix_tree.py

I let @GeneDer and @eicherseiji review the tests.

python/ray/llm/tests/serve/cpu/deployments/test_prefix_tree.py

GeneDer

Great work! Just kinda feel we shouldn't put all the lock in the read path and should only be in the write path. Also remember Cody proposed using 2 trees for this during brainstorming, was there some other discussion concluded to just use one tree with locks? Are performance not impacted bc of this?

python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py

GeneDer · 2025-05-06T00:39:17Z

python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py

+
+            return removed_chars_len
+
+    def evict_tenant_by_LRU(self, tenant: str, min_remove_size: int) -> int:


Let's keep the method name all lower cased

Suggested change

def evict_tenant_by_LRU(self, tenant: str, min_remove_size: int) -> int:

def evict_tenant_by_lru(self, tenant: str, min_remove_size: int) -> int:

Also who's the caller for this method? Should there be a background task running this?

python/ray/llm/_internal/serve/deployments/routers/prefix_tree.py

Signed-off-by: Justin Ji <[email protected]>

GeneDer

Code looks so much nicer, great job! Would recommend to break up the test cases into individual tests instead of jamming them all into one test. Can further organized them into test classes if you see fit. Like a test class for TestPrefixMatch, then you have multiple tests one for test_no_match, another for test_non_existing_prefix...etc, so those tests are more manageable

python/ray/llm/_internal/serve/replica_scheduler/prefix_aware/prefix_tree.py

python/ray/llm/tests/serve/cpu/deployments/test_prefix_tree.py

GeneDer · 2025-05-08T03:21:13Z

python/ray/llm/tests/serve/cpu/deployments/test_prefix_tree.py

+    assert h_node.edge_label_to_child.get("w").text == "world"
+    assert h_node.edge_label_to_child.get("t").text == "there"
+
+    # 4. Test that inserting a longer prompt with shared prefix doesn't create empty text nodes


Would suggest to break out all those into multiple tests. Making it easier to debug if anything failed and not need to reset everytime. Just make sure to give it meaningful names, like test_insert_long_prompt...

GeneDer · 2025-05-08T03:23:02Z

python/ray/llm/tests/serve/cpu/deployments/test_prefix_tree.py

+    matched_text, matched_tenants = tree.prefix_match("application", ["tenant_3"])
+    assert matched_text == "" and matched_tenants is None
+
+    # 7. Test shared prefix matching with branches


Similarly break those out into it's own test

… because of pickle error Signed-off-by: Justin Ji <[email protected]>

python/ray/llm/_internal/serve/replica_scheduler/prefix_aware/prefix_tree.py

kouroshHakha · 2025-05-11T23:25:19Z

@GeneDer please review this one more time.

GeneDer · 2025-05-11T23:27:33Z

@GeneDer please review this one more time.

Yea, still pending @jujipotle to address the comments.

Signed-off-by: Justin Ji <[email protected]>

GeneDer

Great work! LGTM!

kouroshHakha

STMPING in the last round.

…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]> Signed-off-by: weiran11 <[email protected]>

…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]> Signed-off-by: zhaoch23 <[email protected]>

…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]> Signed-off-by: iamjustinhsu <[email protected]>

…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]>

…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]> Signed-off-by: Vicky Tsang <[email protected]>

…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]> Signed-off-by: Scott Lee <[email protected]>

jujipotle added 5 commits May 2, 2025 10:35

Add prefix tree class

c691e8b

Signed-off-by: Justin Ji <[email protected]>

Linting

939aa7e

Signed-off-by: Justin Ji <[email protected]>

Add test cases

8eb8a9c

Signed-off-by: Justin Ji <[email protected]>

Implement eviction, tests passing

3e55d54

Signed-off-by: Justin Ji <[email protected]>

Linting

f1f6e95

Signed-off-by: Justin Ji <[email protected]>

kouroshHakha added the go add ONLY when ready to merge, run all tests label May 4, 2025

kouroshHakha reviewed May 4, 2025

View reviewed changes

kouroshHakha changed the title ~~[serve.llm] Add prefix tree class as precursor for prefix-aware scheduler~~ [serve.llm] Prefix-aware scheduler [1/N] Adding Prefix-aware tree data structure May 4, 2025

kouroshHakha reviewed May 4, 2025

View reviewed changes

python/ray/llm/tests/serve/cpu/deployments/test_prefix_tree.py Outdated Show resolved Hide resolved

jujipotle marked this pull request as ready for review May 5, 2025 17:07

jujipotle requested a review from a team as a code owner May 5, 2025 17:07

GeneDer reviewed May 6, 2025

View reviewed changes

jujipotle added 4 commits May 5, 2025 18:10

Address comments

ee9568d

Signed-off-by: Justin Ji <[email protected]>

Address comments

9c110bb

Signed-off-by: Justin Ji <[email protected]>

Clean up code, separate base class from serve deployment

8878844

Signed-off-by: Justin Ji <[email protected]>

linting

3e2e393

Signed-off-by: Justin Ji <[email protected]>

hainesmichaelc added the community-contribution Contributed by the community label May 7, 2025

jujipotle added 4 commits May 7, 2025 13:04

remove unnecessary instance variables

7374369

Signed-off-by: Justin Ji <[email protected]>

Update tests

a0db565

Signed-off-by: Justin Ji <[email protected]>

Add PrefixTreeActor

9fa20bb

Signed-off-by: Justin Ji <[email protected]>

Edit comments

c0bb33b

Signed-off-by: Justin Ji <[email protected]>

GeneDer reviewed May 8, 2025

View reviewed changes

Doubly linked list instead of min-heap, don't have insert return Node…

42d6938

… because of pickle error Signed-off-by: Justin Ji <[email protected]>

kouroshHakha reviewed May 9, 2025

View reviewed changes

kouroshHakha requested a review from GeneDer May 11, 2025 23:25

Fix LRU linked list implementation and clean up tests

0e0bb82

Signed-off-by: Justin Ji <[email protected]>

GeneDer approved these changes May 12, 2025

View reviewed changes

masoudcharkhabi added the serve Ray Serve Related Issue label May 12, 2025

masoudcharkhabi added the performance label May 12, 2025

kouroshHakha removed the community-contribution Contributed by the community label May 12, 2025

kouroshHakha approved these changes May 12, 2025

View reviewed changes

kouroshHakha merged commit ba1a8df into ray-project:master May 12, 2025
5 checks passed

lk-chen pushed a commit to lk-chen/ray that referenced this pull request May 17, 2025

[serve.llm] Prefix-aware scheduler [1/N] Adding Prefix-aware tree dat…

f53b6fa

…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]>

hainesmichaelc added the community-backlog label May 22, 2025

vickytsang pushed a commit to ROCm/ray that referenced this pull request Jun 3, 2025

[serve.llm] Prefix-aware scheduler [1/N] Adding Prefix-aware tree dat…

e780c47

…a structure (ray-project#52747) Signed-off-by: Justin Ji <[email protected]> Signed-off-by: Vicky Tsang <[email protected]>


		return removed_chars_len

		def evict_tenant_by_LRU(self, tenant: str, min_remove_size: int) -> int:

[serve.llm] Prefix-aware scheduler [1/N] Adding Prefix-aware tree data structure #52747

[serve.llm] Prefix-aware scheduler [1/N] Adding Prefix-aware tree data structure #52747

Uh oh!

Conversation

jujipotle commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Features:

Related issue number

Checks

Uh oh!

kouroshHakha left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha May 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

GeneDer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GeneDer May 6, 2025

Choose a reason for hiding this comment

Uh oh!

GeneDer May 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

GeneDer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GeneDer May 8, 2025

Choose a reason for hiding this comment

Uh oh!

GeneDer May 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha commented May 11, 2025

Uh oh!

GeneDer commented May 11, 2025

Uh oh!

GeneDer left a comment

Choose a reason for hiding this comment

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jujipotle commented May 2, 2025 •

edited

Loading

kouroshHakha left a comment •

edited

Loading