Skip to content

fix: Tokenizer memory leak on initial high QPS#368

Open
albertoperdomo2 wants to merge 2 commits intollm-d:release/v0.4.0from
albertoperdomo2:v0.4.0-fix
Open

fix: Tokenizer memory leak on initial high QPS#368
albertoperdomo2 wants to merge 2 commits intollm-d:release/v0.4.0from
albertoperdomo2:v0.4.0-fix

Conversation

@albertoperdomo2
Copy link
Contributor

Summary

This PR fixes tokenizer startup memory leak under high concurrency and makes eviction safe for in-flight requests. With recent v0.4.0 testing profiling, the issue was narrowed down to two main bugs:

  1. singleflight load results were not reliably retained in the tokenizer LRU cache, allowing repeated same-model loads during startup bursts.
    Cache population after singleflight always happens in successful loads and ensures one model load is reused by subsequent requests.

  2. Tokenizers were never Closed properly.
    Cache creation uses lru.NewWithEvict(...) so cleanup logic runs on eviction.

  3. Eviction cleanup could close a tokenizer while it was still in use by active requests.
    Use refcount to manage the lifecycle and lock of tokenizers.

Test plan

Create a dev image for the llm-d-inference-scheduler at rh-ee-aperdomo/llm-d-inference-scheduler:v0.4.0 with the fix, and run the benchmarks that showed the leak in the first place.

image

Related issues

…zers

Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
@albertoperdomo2
Copy link
Contributor Author

cc: @vMaroon @Gregory-Pereira

@albertoperdomo2
Copy link
Contributor Author

EPP container memory usage

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant