fix: Tokenizer memory leak on initial high QPS by albertoperdomo2 · Pull Request #368 · llm-d/llm-d-kv-cache

albertoperdomo2 · 2026-02-27T16:03:06Z

Summary

This PR fixes tokenizer startup memory leak under high concurrency and makes eviction safe for in-flight requests. With recent v0.4.0 testing profiling, the issue was narrowed down to two main bugs:

singleflight load results were not reliably retained in the tokenizer LRU cache, allowing repeated same-model loads during startup bursts.
Cache population after singleflight always happens in successful loads and ensures one model load is reused by subsequent requests.
Tokenizers were never Closed properly.
Cache creation uses lru.NewWithEvict(...) so cleanup logic runs on eviction.
Eviction cleanup could close a tokenizer while it was still in use by active requests.
Use refcount to manage the lifecycle and lock of tokenizers.

Test plan

Create a dev image for the llm-d-inference-scheduler at rh-ee-aperdomo/llm-d-inference-scheduler:v0.4.0 with the fix, and run the benchmarks that showed the leak in the first place.

Related issues

…zers Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

albertoperdomo2 · 2026-02-27T16:03:21Z

cc: @vMaroon @Gregory-Pereira

albertoperdomo2 · 2026-02-27T16:34:16Z

EPP container memory usage

albertoperdomo2 added 2 commits February 27, 2026 13:33

fix(tokenizer): Dedup startup tokenizer load and close evicted tokeni…

1c2453f

…zers Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

fix(tokenizer): Refcount-based lifecycle for tokenizer cache entries

33f34cc

Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

albertoperdomo2 requested review from dannyharnik, kfirtoledo and vMaroon as code owners February 27, 2026 16:03

github-actions bot requested review from hyeongyun0916, liu-cong, sagearc and yankay February 27, 2026 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Tokenizer memory leak on initial high QPS#368

fix: Tokenizer memory leak on initial high QPS#368
albertoperdomo2 wants to merge 2 commits intollm-d:release/v0.4.0from
albertoperdomo2:v0.4.0-fix

albertoperdomo2 commented Feb 27, 2026

Uh oh!

albertoperdomo2 commented Feb 27, 2026

Uh oh!

albertoperdomo2 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

albertoperdomo2 commented Feb 27, 2026

Summary

Test plan

Related issues

Uh oh!

albertoperdomo2 commented Feb 27, 2026

Uh oh!

albertoperdomo2 commented Feb 27, 2026

EPP container memory usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant