run tokenizer warmup in async mode by pavel-esir · Pull Request #3471 · openvinotoolkit/openvino.genai

pavel-esir · 2026-03-10T14:46:28Z

Description

Run warmup inference in async mode, so that it will not be blocked until inference end. This is done to improve TTFT

Checklist:

This PR follows GenAI Contributing guidelines.
Tests have been updated or added to cover the new code. Almost every existing test already covers the code
This PR fully addresses the ticket.
I have made corresponding changes to the documentation. no update is needed, it's internal optimization, not public

Copilot

Pull request overview

This PR updates the tokenizer initialization warmup to run inference asynchronously, aiming to reduce startup blocking and improve TTFT by warming tokenizer caches without waiting for completion.

Changes:

Replaced synchronous warmup (encode("non empty string")) with an async InferRequest::start_async() warmup during TokenizerImpl::setup_tokenizer().
Added a scoped warmup block that prepares an input tensor and starts async inference.

src/cpp/src/tokenizer/tokenizer_impl.cpp

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

src/cpp/src/tokenizer/tokenizer_impl.cpp

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

src/cpp/src/tokenizer/tokenizer_impl.cpp

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

src/cpp/src/tokenizer/tokenizer_impl.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

src/cpp/src/tokenizer/tokenizer_impl.cpp

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

src/cpp/src/tokenizer/tokenizer_impl.cpp

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Copilot · 2026-03-12T09:42:11Z

src/cpp/src/tokenizer/tokenizer_impl.cpp

+
+            // TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup.
+            // shared_ptr to keep input data alive until async request is finished
+            auto warmup_text = std::make_shared<std::string>("non empty string");
+            auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get());
+
+            req.set_input_tensor(0, warmup_tensor);
+            if (is_paired_input) {
+                // Set to an empty tensor to avoid errors.
+                // The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct.
+                req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}});
+            }
+
+            req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {
+                // this empty placeholder keeps input data alive until request is finished
+                (void) warmup_text;
+                queue->return_to(idx);
+
+            });
+            req.start_async();


idx is reserved from m_ireq_queue_tokenizer and only returned in the async callback. If any of set_input_tensor, set_callback, or start_async() throws, the slot will never be returned to the pool, permanently shrinking the queue and potentially deadlocking future get_idle() calls. Add exception-safety (e.g., a scope guard/try-catch that calls return_to(idx) on failure before rethrowing).

Suggested change

// TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup.

// shared_ptr to keep input data alive until async request is finished

auto warmup_text = std::make_shared<std::string>("non empty string");

auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get());

req.set_input_tensor(0, warmup_tensor);

if (is_paired_input) {

// Set to an empty tensor to avoid errors.

// The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct.

req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}});

}

req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {

// this empty placeholder keeps input data alive until request is finished

(void) warmup_text;

queue->return_to(idx);

});

req.start_async();

bool return_slot_on_failure = true;

try {

// TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup.

// shared_ptr to keep input data alive until async request is finished

auto warmup_text = std::make_shared<std::string>("non empty string");

auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get());

req.set_input_tensor(0, warmup_tensor);

if (is_paired_input) {

// Set to an empty tensor to avoid errors.

// The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct.

req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}});

}

req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {

// this empty placeholder keeps input data alive until request is finished

(void) warmup_text;

queue->return_to(idx);

});

req.start_async();

// After successful start_async, the callback is responsible for returning the slot.

return_slot_on_failure = false;

} catch (...) {

if (return_slot_on_failure) {

m_ireq_queue_tokenizer->return_to(idx);

}

throw;

}

Copilot · 2026-03-12T09:42:12Z

src/cpp/src/tokenizer/tokenizer_impl.cpp

+            req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {
+                // this empty placeholder keeps input data alive until request is finished
+                (void) warmup_text;
+                queue->return_to(idx);
+
+            });


The async warmup callback captures m_ireq_queue_tokenizer.get() as a raw pointer and calls queue->return_to(idx). If TokenizerImpl (and its m_ireq_queue_tokenizer) is destroyed before the warmup completes, the callback will dereference a dangling pointer (UAF). Consider ensuring warmup completion before destroying the queue (e.g., track a future/promise and wait in teardown) or avoid referencing the queue from the callback by using a dedicated warmup request with well-defined lifetime management.

src/cpp/src/tokenizer/tokenizer_impl.cpp

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

src/cpp/src/tokenizer/tokenizer_impl.cpp

            m_bos_token = decode(std::vector{m_bos_token_id}, {ov::genai::skip_special_tokens(false)});
        if (m_eos_token_id != -1 && m_eos_token.empty())
            m_eos_token = decode(std::vector{m_eos_token_id}, {ov::genai::skip_special_tokens(false)});
+


Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

src/cpp/src/tokenizer/tokenizer_impl.cpp

+            int idx = m_ireq_queue_tokenizer->get_idle().get();
+            auto& req = m_ireq_queue_tokenizer->get(idx);
+
+            // TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup.
+            // shared_ptr to keep input data alive until async request is finished
+            auto warmup_text = std::make_shared<std::string>("non empty string");
+            auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get());
+
+            req.set_input_tensor(0, warmup_tensor);
+            if (is_paired_input) {
+                // Set to an empty tensor to avoid errors.
+                // The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct.
+                req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}});
+            }
+
+            req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {
+                // this empty placeholder keeps input data alive until request is finished
+                (void) warmup_text;
+                queue->return_to(idx);
+                req.set_callback({});
+
+            });
+            req.start_async();


src/cpp/src/tokenizer/tokenizer_impl.cpp

+            req.set_input_tensor(0, warmup_tensor);
+            if (is_paired_input) {
+                // Set to an empty tensor to avoid errors.
+                // The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct.
+                req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}});
+            }
+
+            req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {
+                // this empty placeholder keeps input data alive until request is finished
+                (void) warmup_text;
+                queue->return_to(idx);
+                req.set_callback({});
+
+            });
+            req.start_async();


src/cpp/src/tokenizer/tokenizer_impl.cpp

+            int idx = m_ireq_queue_detokenizer->get_idle().get();
+            auto& req = m_ireq_queue_detokenizer->get(idx);
+
+            // shared_ptr to keep input data alive until async request is finished
+            auto warmup_tokens = std::make_shared<std::vector<int64_t>>(
+                std::initializer_list<int64_t>{1, 33, 199, 42, 42}
+            );
+
+            auto warmup_tensor = ov::Tensor(ov::element::i64, ov::Shape{1, warmup_tokens->size()}, warmup_tokens->data());
+            req.set_input_tensor(0, warmup_tensor);
+
+            req.set_callback([queue = m_ireq_queue_detokenizer.get(), idx, warmup_tokens, &req](std::exception_ptr) {
+                // this empty placeholder keeps input data alive until request is finished
+                (void) warmup_tokens;
+                queue->return_to(idx);
+                req.set_callback({});
+            });
+            req.start_async();


src/cpp/src/tokenizer/tokenizer_impl.cpp

+            int idx = m_ireq_queue_tokenizer->get_idle().get();
+            auto& req = m_ireq_queue_tokenizer->get(idx);
+
+            // TODO CVS-150630: Empty strings sporadically can fail, therefore use nonempty string for warmup.
+            // shared_ptr to keep input data alive until async request is finished
+            auto warmup_text = std::make_shared<std::string>("non empty string");
+            auto warmup_tensor = ov::Tensor(ov::element::string, ov::Shape{1}, warmup_text.get());
+
+            req.set_input_tensor(0, warmup_tensor);
+            if (is_paired_input) {
+                // Set to an empty tensor to avoid errors.
+                // The subgraph within the ov::Model will handle this scenario, ensuring the output remains correct.
+                req.set_input_tensor(1, ov::Tensor{ov::element::string, {0}});
+            }
+
+            req.set_callback([queue = m_ireq_queue_tokenizer.get(), idx, warmup_text](std::exception_ptr) {
+                // this empty placeholder keeps input data alive until request is finished
+                (void) warmup_text;
+                queue->return_to(idx);
+                req.set_callback({});
+
+            });
+            req.start_async();


pavel-esir added this to the 2026.1 milestone Mar 10, 2026

pavel-esir requested a review from apaniukov as a code owner March 10, 2026 14:46

Copilot AI review requested due to automatic review settings March 10, 2026 14:46

pavel-esir added the Code Freeze label Mar 10, 2026

github-actions bot added the category: tokenizers Tokenizer class or submodule update label Mar 10, 2026

Copilot started reviewing on behalf of pavel-esir March 10, 2026 14:47 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

src/cpp/src/tokenizer/tokenizer_impl.cpp Outdated Show resolved Hide resolved

src/cpp/src/tokenizer/tokenizer_impl.cpp Outdated Show resolved Hide resolved

src/cpp/src/tokenizer/tokenizer_impl.cpp Outdated Show resolved Hide resolved

run tokenizer warmup in async mode

00d5fba

pavel-esir requested a review from Copilot March 10, 2026 15:13

Copilot started reviewing on behalf of pavel-esir March 10, 2026 15:14 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

src/cpp/src/tokenizer/tokenizer_impl.cpp Show resolved Hide resolved

src/cpp/src/tokenizer/tokenizer_impl.cpp Outdated Show resolved Hide resolved

pavel-esir requested a review from Copilot March 10, 2026 15:31

Copilot started reviewing on behalf of pavel-esir March 10, 2026 15:32 View session

some corrections

dbd0acd

Copilot AI reviewed Mar 10, 2026

View reviewed changes

src/cpp/src/tokenizer/tokenizer_impl.cpp Show resolved Hide resolved

src/cpp/src/tokenizer/tokenizer_impl.cpp Show resolved Hide resolved

src/cpp/src/tokenizer/tokenizer_impl.cpp Show resolved Hide resolved

make it memory safe

22062f6

apaniukov reviewed Mar 10, 2026

View reviewed changes

src/cpp/src/tokenizer/tokenizer_impl.cpp Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings March 10, 2026 19:16

Copilot started reviewing on behalf of pavel-esir March 10, 2026 19:17 View session

pavel-esir requested a review from apaniukov March 10, 2026 19:18

Copilot AI reviewed Mar 10, 2026

View reviewed changes

src/cpp/src/tokenizer/tokenizer_impl.cpp Outdated Show resolved Hide resolved

src/cpp/src/tokenizer/tokenizer_impl.cpp Outdated Show resolved Hide resolved

src/cpp/src/tokenizer/tokenizer_impl.cpp Outdated Show resolved Hide resolved

pavel-esir and others added 4 commits March 10, 2026 20:26

async warmup for detokenizer

2989d5d

make more concise

c5da22f

Merge branch 'master' into asycn_warmup

6b737d3

Apply suggestions from code review

589b98f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 11, 2026 08:55

Copilot started reviewing on behalf of pavel-esir March 11, 2026 08:56 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

src/cpp/src/tokenizer/tokenizer_impl.cpp Outdated Show resolved Hide resolved

src/cpp/src/tokenizer/tokenizer_impl.cpp Outdated Show resolved Hide resolved

add shared_ptr for warmup

e3d67c9

Copilot AI review requested due to automatic review settings March 11, 2026 09:22

pavel-esir added 2 commits March 11, 2026 10:44

ensure labmda is not wiped out by -Wunused-lambda-capture

1d3b875

fix memory leakage from cyclic reference

cf0fcd2

Copilot AI review requested due to automatic review settings March 11, 2026 11:28

Copilot started reviewing on behalf of pavel-esir March 11, 2026 11:29 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

src/cpp/src/tokenizer/tokenizer_impl.cpp Show resolved Hide resolved

src/cpp/src/tokenizer/tokenizer_impl.cpp Outdated Show resolved Hide resolved

pavel-esir added 2 commits March 11, 2026 12:50

fix typo

9806ff5

return infer request to the pool

4286aad

apaniukov approved these changes Mar 11, 2026

View reviewed changes

Merge branch 'master' into asycn_warmup

091d0f7

Copilot AI review requested due to automatic review settings March 11, 2026 14:32

Copilot AI reviewed Mar 11, 2026

View reviewed changes

pavel-esir requested a review from Copilot March 12, 2026 09:37

Copilot started reviewing on behalf of pavel-esir March 12, 2026 09:37 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Merge branch 'master' into asycn_warmup

7519769

Wovchena requested changes Mar 12, 2026

View reviewed changes

src/cpp/src/tokenizer/tokenizer_impl.cpp Outdated Show resolved Hide resolved

Merge branch 'master' into asycn_warmup

9671fb1

Copilot AI review requested due to automatic review settings March 13, 2026 10:07

Copilot started reviewing on behalf of pavel-esir March 13, 2026 10:08 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

src/cpp/src/tokenizer/tokenizer_impl.cpp

m_bos_token = decode(std::vector{m_bos_token_id}, {ov::genai::skip_special_tokens(false)});

if (m_eos_token_id != -1 && m_eos_token.empty())

m_eos_token = decode(std::vector{m_eos_token_id}, {ov::genai::skip_special_tokens(false)});

Copilot AI review requested due to automatic review settings March 13, 2026 13:20

pavel-esir force-pushed the asycn_warmup branch from 41f686e to 9117c1e Compare March 13, 2026 13:20

Copilot started reviewing on behalf of pavel-esir March 13, 2026 13:21 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

pavel-esir force-pushed the asycn_warmup branch from 9117c1e to a0c0937 Compare March 13, 2026 13:30

Wovchena approved these changes Mar 13, 2026

View reviewed changes

remove callback after warmup

a0c0937

pavel-esir added this pull request to the merge queue Mar 13, 2026

Merged via the queue into openvinotoolkit:master with commit 4e0670f Mar 13, 2026
177 of 183 checks passed

Conversation

pavel-esir commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pavel-esir commented Mar 10, 2026 •

edited

Loading