Skip to content

fix: Tokenizer initialization race condition (multiple parallel downloads)#291

Open
albertoperdomo2 wants to merge 3 commits intollm-d:mainfrom
albertoperdomo2:fix/tokenizer-initialization-leak
Open

fix: Tokenizer initialization race condition (multiple parallel downloads)#291
albertoperdomo2 wants to merge 3 commits intollm-d:mainfrom
albertoperdomo2:fix/tokenizer-initialization-leak

Conversation

@albertoperdomo2
Copy link
Contributor

@albertoperdomo2 albertoperdomo2 commented Feb 4, 2026

Summary

Under high QPS during cold start, multiple concurrent requests triggered duplicate tokenizer downloads due to a check-then-act race condition. It is mentioned in #191 and this caused:

  • Parallel downloads for a single model tokenizer
  • High request failure rate from file corruption
  • High memory usage (for parallel downloads)

The steps I followed to verify the condition with meta-llama/Llama-3.2-1B are:

  1. Create venv with requirements
cd services/uds_tokenizer
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install psutil  # for memory monitoring
  1. Clear Tokenizer cache
rm -rf ~/.cache/huggingface/hub/models--*
rm -rf models/meta-llama
  1. Ensure socket directory exists: mkdir -p /tmp/tokenizer

  2. Start the gRPC service

python run_grpc_server.py

In a new terminal window, I ran a script to verify the race condition existed.

To solve it, I implemented double-checked locking with per-model locks in, with a fast path (cache hit) with zero locking overhead and a slow path (cache miss) using threading.Lock to ensure only one thread downloads each model.

Test plan

The initial runs of the script (which launches 50 requests at the same time) reported:

============================================================
Memory Summary
============================================================
Initial:  75.7 MB
Peak:     494.5 MB
Final:    298.6 MB
Increase: 223.0 MB

Total test duration: 31.33s

with several errors in the service logs like:

2026-02-04 10:30:15,891 [ERROR] [root] Failed to load tokenizer from /<path>/<to>/llm-d-kv-cache/services/uds_tokenizer/models/meta-llama/Llama-3.2-1B: No such file or directory (os error 2)

2026-02-04 10:30:15,891 [ERROR] [root] Failed to initialize tokenizer for model meta-llama/Llama-3.2-1B: Failed to load tokenizer: No such file or directory (os error 2)

2026-02-04 10:30:20,153 [ERROR] [root] Failed to initialize tokenizer for model meta-llama/Llama-3.2-1B: Failed to load tokenizer: ...

And after the proposed fix:

============================================================
Memory Summary
============================================================
Initial:  75.3 MB
Peak:     173.7 MB
Final:    173.7 MB
Increase: 98.4 MB

Total test duration: 2.16s

Ensuring that the initial memory leak is gone, which directly impacts performance too.

Related issues

Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
acquired_locks.append(lock)

_tokenizer_cache.clear()
return "Tokenizer caches cleared"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean you will never have more than 1 item in acquired_locks given you return immediately after acquiring one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be outside of the loop, my bad.

_tokenizer_cache[key] = tokenizer
return key

lock = _cache_locks[key]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need a lock to synchronize access to the _cache_locks dict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, that would be necessary. This is primarily based on my approach to have per model locking, but if we do not need that, we can make it simpler with a single global lock.

_tokenizer_cache.clear()
return "Tokenizer caches cleared"
# Sorted locks to avoid deadlock
keys_to_lock = sorted(_cache_locks.keys())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we never insert any item to _cache_locks map (relying on defaultdict to get the lock). So this will always be empty?

Copy link
Contributor Author

@albertoperdomo2 albertoperdomo2 Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a defaultdict() so if you call it and the key does not exist, it will automatically create it on first access.

Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
@vMaroon vMaroon requested a review from liu-cong February 6, 2026 22:15
@vMaroon
Copy link
Member

vMaroon commented Feb 7, 2026

Does this still occur with the new approach of loading tokenizer on module initialization and not after requests come in? Might have missed closing that issue. cc @sagearc

@albertoperdomo2
Copy link
Contributor Author

albertoperdomo2 commented Feb 7, 2026

I could certainly reproduce the issue (3 days ago).

@vMaroon vMaroon requested a review from liu-cong February 13, 2026 21:52
@github-actions
Copy link

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

@sagearc
Copy link
Collaborator

sagearc commented Feb 17, 2026

@vMaroon Initially, the tokenizer initialization at startup was intended for daulet/tokenizers. I'm unsure of the status for tokenizers initialized within the chat template python file. If I remember correctly, even the python interpreter itself is not safely initialized (it should only happen once) and is currently prone to race conditions within the CGO bindings file.
Soon it won't be an issue.

@albertoperdomo2 albertoperdomo2 changed the base branch from main to release/v0.4.0 February 26, 2026 21:39
@albertoperdomo2 albertoperdomo2 changed the base branch from release/v0.4.0 to main February 27, 2026 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants