fix: Tokenizer initialization race condition (multiple parallel downloads)#291
fix: Tokenizer initialization race condition (multiple parallel downloads)#291albertoperdomo2 wants to merge 3 commits intollm-d:mainfrom
Conversation
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
| acquired_locks.append(lock) | ||
|
|
||
| _tokenizer_cache.clear() | ||
| return "Tokenizer caches cleared" |
There was a problem hiding this comment.
does this mean you will never have more than 1 item in acquired_locks given you return immediately after acquiring one?
There was a problem hiding this comment.
This should be outside of the loop, my bad.
| _tokenizer_cache[key] = tokenizer | ||
| return key | ||
|
|
||
| lock = _cache_locks[key] |
There was a problem hiding this comment.
don't we need a lock to synchronize access to the _cache_locks dict?
There was a problem hiding this comment.
You are right, that would be necessary. This is primarily based on my approach to have per model locking, but if we do not need that, we can make it simpler with a single global lock.
| _tokenizer_cache.clear() | ||
| return "Tokenizer caches cleared" | ||
| # Sorted locks to avoid deadlock | ||
| keys_to_lock = sorted(_cache_locks.keys()) |
There was a problem hiding this comment.
It looks like we never insert any item to _cache_locks map (relying on defaultdict to get the lock). So this will always be empty?
There was a problem hiding this comment.
It is a defaultdict() so if you call it and the key does not exist, it will automatically create it on first access.
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
|
Does this still occur with the new approach of loading tokenizer on module initialization and not after requests come in? Might have missed closing that issue. cc @sagearc |
|
I could certainly reproduce the issue (3 days ago). |
|
Unsigned commits detected! Please sign your commits. For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation. |
|
@vMaroon Initially, the tokenizer initialization at startup was intended for |
Summary
Under high QPS during cold start, multiple concurrent requests triggered duplicate tokenizer downloads due to a check-then-act race condition. It is mentioned in #191 and this caused:
The steps I followed to verify the condition with meta-llama/Llama-3.2-1B are:
Ensure socket directory exists:
mkdir -p /tmp/tokenizerStart the gRPC service
In a new terminal window, I ran a script to verify the race condition existed.
To solve it, I implemented double-checked locking with per-model locks in, with a fast path (cache hit) with zero locking overhead and a slow path (cache miss) using
threading.Lockto ensure only one thread downloads each model.Test plan
The initial runs of the script (which launches 50 requests at the same time) reported:
with several errors in the service logs like:
And after the proposed fix:
Ensuring that the initial memory leak is gone, which directly impacts performance too.
Related issues