Skip to content

run tokenizer warmup in async mode#3471

Merged
pavel-esir merged 16 commits intoopenvinotoolkit:masterfrom
pavel-esir:asycn_warmup
Mar 13, 2026
Merged

run tokenizer warmup in async mode#3471
pavel-esir merged 16 commits intoopenvinotoolkit:masterfrom
pavel-esir:asycn_warmup

Conversation

@pavel-esir
Copy link
Contributor

@pavel-esir pavel-esir commented Mar 10, 2026

Description

Run warmup inference in async mode, so that it will not be blocked until inference end. This is done to improve TTFT

CVS-180365 CVS-180801

Checklist:

  • This PR follows GenAI Contributing guidelines.
  • Tests have been updated or added to cover the new code. Almost every existing test already covers the code
  • This PR fully addresses the ticket.
  • I have made corresponding changes to the documentation. no update is needed, it's internal optimization, not public

@pavel-esir pavel-esir added this to the 2026.1 milestone Mar 10, 2026
@pavel-esir pavel-esir requested a review from apaniukov as a code owner March 10, 2026 14:46
Copilot AI review requested due to automatic review settings March 10, 2026 14:46
@github-actions github-actions bot added the category: tokenizers Tokenizer class or submodule update label Mar 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the tokenizer initialization warmup to run inference asynchronously, aiming to reduce startup blocking and improve TTFT by warming tokenizer caches without waiting for completion.

Changes:

  • Replaced synchronous warmup (encode("non empty string")) with an async InferRequest::start_async() warmup during TokenizerImpl::setup_tokenizer().
  • Added a scoped warmup block that prepares an input tensor and starts async inference.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Copilot AI review requested due to automatic review settings March 10, 2026 19:16
@pavel-esir pavel-esir requested a review from apaniukov March 10, 2026 19:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Copilot AI review requested due to automatic review settings March 11, 2026 08:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings March 11, 2026 09:22
Copilot AI review requested due to automatic review settings March 11, 2026 11:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings March 11, 2026 14:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Comment on lines +430 to +449

// TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup.
// shared_ptr to keep input data alive until async request is finished
auto warmup_text = std::make_shared<std::string>("non empty string");
auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get());

req.set_input_tensor(0, warmup_tensor);
if (is_paired_input) {
// Set to an empty tensor to avoid errors.
// The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct.
req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}});
}

req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {
// this empty placeholder keeps input data alive until request is finished
(void) warmup_text;
queue->return_to(idx);

});
req.start_async();
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idx is reserved from m_ireq_queue_tokenizer and only returned in the async callback. If any of set_input_tensor, set_callback, or start_async() throws, the slot will never be returned to the pool, permanently shrinking the queue and potentially deadlocking future get_idle() calls. Add exception-safety (e.g., a scope guard/try-catch that calls return_to(idx) on failure before rethrowing).

Suggested change
// TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup.
// shared_ptr to keep input data alive until async request is finished
auto warmup_text = std::make_shared<std::string>("non empty string");
auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get());
req.set_input_tensor(0, warmup_tensor);
if (is_paired_input) {
// Set to an empty tensor to avoid errors.
// The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct.
req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}});
}
req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {
// this empty placeholder keeps input data alive until request is finished
(void) warmup_text;
queue->return_to(idx);
});
req.start_async();
bool return_slot_on_failure = true;
try {
// TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup.
// shared_ptr to keep input data alive until async request is finished
auto warmup_text = std::make_shared<std::string>("non empty string");
auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get());
req.set_input_tensor(0, warmup_tensor);
if (is_paired_input) {
// Set to an empty tensor to avoid errors.
// The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct.
req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}});
}
req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {
// this empty placeholder keeps input data alive until request is finished
(void) warmup_text;
queue->return_to(idx);
});
req.start_async();
// After successful start_async, the callback is responsible for returning the slot.
return_slot_on_failure = false;
} catch (...) {
if (return_slot_on_failure) {
m_ireq_queue_tokenizer->return_to(idx);
}
throw;
}

Copilot uses AI. Check for mistakes.
Comment on lines +443 to +448
req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {
// this empty placeholder keeps input data alive until request is finished
(void) warmup_text;
queue->return_to(idx);

});
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The async warmup callback captures m_ireq_queue_tokenizer.get() as a raw pointer and calls queue->return_to(idx). If TokenizerImpl (and its m_ireq_queue_tokenizer) is destroyed before the warmup completes, the callback will dereference a dangling pointer (UAF). Consider ensuring warmup completion before destroying the queue (e.g., track a future/promise and wait in teardown) or avoid referencing the queue from the callback by using a dedicated warmup request with well-defined lifetime management.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 13, 2026 10:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

m_bos_token = decode(std::vector{m_bos_token_id}, {ov::genai::skip_special_tokens(false)});
if (m_eos_token_id != -1 && m_eos_token.empty())
m_eos_token = decode(std::vector{m_eos_token_id}, {ov::genai::skip_special_tokens(false)});

Copilot AI review requested due to automatic review settings March 13, 2026 13:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Comment on lines +428 to +450
int idx = m_ireq_queue_tokenizer->get_idle().get();
auto& req = m_ireq_queue_tokenizer->get(idx);

// TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup.
// shared_ptr to keep input data alive until async request is finished
auto warmup_text = std::make_shared<std::string>("non empty string");
auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get());

req.set_input_tensor(0, warmup_tensor);
if (is_paired_input) {
// Set to an empty tensor to avoid errors.
// The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct.
req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}});
}

req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {
// this empty placeholder keeps input data alive until request is finished
(void) warmup_text;
queue->return_to(idx);
req.set_callback({});

});
req.start_async();
Comment on lines +436 to +450
req.set_input_tensor(0, warmup_tensor);
if (is_paired_input) {
// Set to an empty tensor to avoid errors.
// The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct.
req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}});
}

req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {
// this empty placeholder keeps input data alive until request is finished
(void) warmup_text;
queue->return_to(idx);
req.set_callback({});

});
req.start_async();
Comment on lines +477 to +494
int idx = m_ireq_queue_detokenizer->get_idle().get();
auto& req = m_ireq_queue_detokenizer->get(idx);

// shared_ptr to keep input data alive until async request is finished
auto warmup_tokens = std::make_shared<std::vector<int64_t>>(
std::initializer_list<int64_t>{1, 33, 199, 42, 42}
);

auto warmup_tensor = ov::Tensor(ov::element::i64, ov::Shape{1, warmup_tokens->size()}, warmup_tokens->data());
req.set_input_tensor(0, warmup_tensor);

req.set_callback([queue = m_ireq_queue_detokenizer.get(), idx, warmup_tokens, &req](std::exception_ptr) {
// this empty placeholder keeps input data alive until request is finished
(void) warmup_tokens;
queue->return_to(idx);
req.set_callback({});
});
req.start_async();
Comment on lines +428 to +450
int idx = m_ireq_queue_tokenizer->get_idle().get();
auto& req = m_ireq_queue_tokenizer->get(idx);

// TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup.
// shared_ptr to keep input data alive until async request is finished
auto warmup_text = std::make_shared<std::string>("non empty string");
auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get());

req.set_input_tensor(0, warmup_tensor);
if (is_paired_input) {
// Set to an empty tensor to avoid errors.
// The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct.
req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}});
}

req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {
// this empty placeholder keeps input data alive until request is finished
(void) warmup_text;
queue->return_to(idx);
req.set_callback({});

});
req.start_async();
@pavel-esir pavel-esir added this pull request to the merge queue Mar 13, 2026
Merged via the queue into openvinotoolkit:master with commit 4e0670f Mar 13, 2026
177 of 183 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: tokenizers Tokenizer class or submodule update Code Freeze

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants