run tokenizer warmup in async mode#3471
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the tokenizer initialization warmup to run inference asynchronously, aiming to reduce startup blocking and improve TTFT by warming tokenizer caches without waiting for completion.
Changes:
- Replaced synchronous warmup (
encode("non empty string")) with an asyncInferRequest::start_async()warmup duringTokenizerImpl::setup_tokenizer(). - Added a scoped warmup block that prepares an input tensor and starts async inference.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
|
||
| // TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup. | ||
| // shared_ptr to keep input data alive until async request is finished | ||
| auto warmup_text = std::make_shared<std::string>("non empty string"); | ||
| auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get()); | ||
|
|
||
| req.set_input_tensor(0, warmup_tensor); | ||
| if (is_paired_input) { | ||
| // Set to an empty tensor to avoid errors. | ||
| // The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct. | ||
| req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}}); | ||
| } | ||
|
|
||
| req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) { | ||
| // this empty placeholder keeps input data alive until request is finished | ||
| (void) warmup_text; | ||
| queue->return_to(idx); | ||
|
|
||
| }); | ||
| req.start_async(); |
There was a problem hiding this comment.
idx is reserved from m_ireq_queue_tokenizer and only returned in the async callback. If any of set_input_tensor, set_callback, or start_async() throws, the slot will never be returned to the pool, permanently shrinking the queue and potentially deadlocking future get_idle() calls. Add exception-safety (e.g., a scope guard/try-catch that calls return_to(idx) on failure before rethrowing).
| // TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup. | |
| // shared_ptr to keep input data alive until async request is finished | |
| auto warmup_text = std::make_shared<std::string>("non empty string"); | |
| auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get()); | |
| req.set_input_tensor(0, warmup_tensor); | |
| if (is_paired_input) { | |
| // Set to an empty tensor to avoid errors. | |
| // The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct. | |
| req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}}); | |
| } | |
| req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) { | |
| // this empty placeholder keeps input data alive until request is finished | |
| (void) warmup_text; | |
| queue->return_to(idx); | |
| }); | |
| req.start_async(); | |
| bool return_slot_on_failure = true; | |
| try { | |
| // TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup. | |
| // shared_ptr to keep input data alive until async request is finished | |
| auto warmup_text = std::make_shared<std::string>("non empty string"); | |
| auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get()); | |
| req.set_input_tensor(0, warmup_tensor); | |
| if (is_paired_input) { | |
| // Set to an empty tensor to avoid errors. | |
| // The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct. | |
| req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}}); | |
| } | |
| req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) { | |
| // this empty placeholder keeps input data alive until request is finished | |
| (void) warmup_text; | |
| queue->return_to(idx); | |
| }); | |
| req.start_async(); | |
| // After successful start_async, the callback is responsible for returning the slot. | |
| return_slot_on_failure = false; | |
| } catch (...) { | |
| if (return_slot_on_failure) { | |
| m_ireq_queue_tokenizer->return_to(idx); | |
| } | |
| throw; | |
| } |
| req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) { | ||
| // this empty placeholder keeps input data alive until request is finished | ||
| (void) warmup_text; | ||
| queue->return_to(idx); | ||
|
|
||
| }); |
There was a problem hiding this comment.
The async warmup callback captures m_ireq_queue_tokenizer.get() as a raw pointer and calls queue->return_to(idx). If TokenizerImpl (and its m_ireq_queue_tokenizer) is destroyed before the warmup completes, the callback will dereference a dangling pointer (UAF). Consider ensuring warmup completion before destroying the queue (e.g., track a future/promise and wait in teardown) or avoid referencing the queue from the callback by using a dedicated warmup request with well-defined lifetime management.
| m_bos_token = decode(std::vector{m_bos_token_id}, {ov::genai::skip_special_tokens(false)}); | ||
| if (m_eos_token_id != -1 && m_eos_token.empty()) | ||
| m_eos_token = decode(std::vector{m_eos_token_id}, {ov::genai::skip_special_tokens(false)}); | ||
|
|
41f686e to
9117c1e
Compare
| int idx = m_ireq_queue_tokenizer->get_idle().get(); | ||
| auto& req = m_ireq_queue_tokenizer->get(idx); | ||
|
|
||
| // TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup. | ||
| // shared_ptr to keep input data alive until async request is finished | ||
| auto warmup_text = std::make_shared<std::string>("non empty string"); | ||
| auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get()); | ||
|
|
||
| req.set_input_tensor(0, warmup_tensor); | ||
| if (is_paired_input) { | ||
| // Set to an empty tensor to avoid errors. | ||
| // The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct. | ||
| req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}}); | ||
| } | ||
|
|
||
| req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) { | ||
| // this empty placeholder keeps input data alive until request is finished | ||
| (void) warmup_text; | ||
| queue->return_to(idx); | ||
| req.set_callback({}); | ||
|
|
||
| }); | ||
| req.start_async(); |
| req.set_input_tensor(0, warmup_tensor); | ||
| if (is_paired_input) { | ||
| // Set to an empty tensor to avoid errors. | ||
| // The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct. | ||
| req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}}); | ||
| } | ||
|
|
||
| req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) { | ||
| // this empty placeholder keeps input data alive until request is finished | ||
| (void) warmup_text; | ||
| queue->return_to(idx); | ||
| req.set_callback({}); | ||
|
|
||
| }); | ||
| req.start_async(); |
| int idx = m_ireq_queue_detokenizer->get_idle().get(); | ||
| auto& req = m_ireq_queue_detokenizer->get(idx); | ||
|
|
||
| // shared_ptr to keep input data alive until async request is finished | ||
| auto warmup_tokens = std::make_shared<std::vector<int64_t>>( | ||
| std::initializer_list<int64_t>{1, 33, 199, 42, 42} | ||
| ); | ||
|
|
||
| auto warmup_tensor = ov::Tensor(ov::element::i64, ov::Shape{1, warmup_tokens->size()}, warmup_tokens->data()); | ||
| req.set_input_tensor(0, warmup_tensor); | ||
|
|
||
| req.set_callback([queue = m_ireq_queue_detokenizer.get(), idx, warmup_tokens, &req](std::exception_ptr) { | ||
| // this empty placeholder keeps input data alive until request is finished | ||
| (void) warmup_tokens; | ||
| queue->return_to(idx); | ||
| req.set_callback({}); | ||
| }); | ||
| req.start_async(); |
| int idx = m_ireq_queue_tokenizer->get_idle().get(); | ||
| auto& req = m_ireq_queue_tokenizer->get(idx); | ||
|
|
||
| // TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup. | ||
| // shared_ptr to keep input data alive until async request is finished | ||
| auto warmup_text = std::make_shared<std::string>("non empty string"); | ||
| auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get()); | ||
|
|
||
| req.set_input_tensor(0, warmup_tensor); | ||
| if (is_paired_input) { | ||
| // Set to an empty tensor to avoid errors. | ||
| // The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct. | ||
| req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}}); | ||
| } | ||
|
|
||
| req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) { | ||
| // this empty placeholder keeps input data alive until request is finished | ||
| (void) warmup_text; | ||
| queue->return_to(idx); | ||
| req.set_callback({}); | ||
|
|
||
| }); | ||
| req.start_async(); |
9117c1e to
a0c0937
Compare
4e0670f
Description
Run warmup inference in async mode, so that it will not be blocked until inference end. This is done to improve TTFT
CVS-180365 CVS-180801
Checklist: