Streaming time-to-first-token can be very large on cold start / GPU warmup; readiness/phase contract unclear (first-token timeouts)

I’m seeing intermittent but sometimes very large **time-to-first-token** (TTFT) latency when using callback streaming. This looks like cold initialization and/or GPU warmup (and possibly internal queueing) happening before the first streamed token is delivered.

Without a readiness/phase signal, it’s hard to distinguish “warming up / queued” from “stalled,” and reasonable first-token watchdogs can trigger unnecessarily.

#### Summary

With callback streaming (`Conversation.sendMessageAsync(text, MessageCallback)`), the **first** `onMessage(...)` can be delayed far longer than steady-state token streaming.

This is most visible:

* on cold start (fresh process / newly loaded model / first GPU use), and
* in tight back-to-back runs (e.g., Step1 → Step2) where the next request starts immediately.

#### What I observe (intermittent)

* The first `onMessage(...)` can arrive much later than expected.
* Latency is inconsistent:

  * fast in warm/steady state,
  * sometimes very slow in cold state.
* In back-to-back runs, the next request can hit first-token timeouts more often.
* During the pre-first-token gap, there is no explicit signal indicating whether the engine is:

  * initializing/warming up,
  * queued behind other work, or
  * stalled.

#### Expected behavior / docs that would help

* Document streaming phases (init/warmup vs active generation) and expected TTFT behavior.
* Ideally provide a readiness/warmed-up signal or recommended probe.
* Guidance on timeout strategy (init vs first token) and session reuse/pre-warming best practices.

#### Questions

1. Is there a defined separation between initialization/warmup and generation phases for streaming?
2. Is there any readiness/warmed-up signal (or recommended probe) that indicates “first token should arrive promptly”?
3. Typical contributors to TTFT (CPU vs GPU, model loading, cache building, internal queueing)?
4. Best practices for session reuse / pre-warming to avoid repeated warmup cost?

#### Temporary client-side mitigation

* Separate watchdogs:

  * `INIT_TIMEOUT_MS` for model/session creation + warmup
  * `FIRST_TOKEN_TIMEOUT_MS` for TTFT after starting a streaming run
* On init timeout/failure, prefer recovery (reset/recreate) rather than hanging UI/state

#### Environment

* LiteRT‑LM: `com.google.ai.edge.litertlm:litertlm-android:0.9.0-alpha04`

  * **Version correlation:** observed **more frequently** with `com.google.ai.edge.litertlm:litertlm-android:0.8.0`.
* Android / device: Google Pixel 9a (API 36)
* ABI: *'arm64-v8a`*
* Model / backend / runtime config:

  * Model: `google/gemma-3n-E4B-it-int4.litertlm`
  * Source: `https://huggingface.co/google/gemma-3n-E4B-it-litert-lm/resolve/main/gemma-3n-E4B-it-int4.litertlm`
  * Backend: GPU (`slm.accelerator=GPU`)
  * Runtime: `max_tokens=4096, top_k=64, top_p=0.95, temperature=1.0`
  * Turn formatting:

    * `user_turn_prefix`: `"<start_of_turn>user"`
    * `model_turn_prefix`: `"<start_of_turn>model"`
    * `turn_end`: `"<end_of_turn>"`
* Build environment: AGP 9.0.0, Kotlin 2.3.10 (Compose BOM 2026.02.00)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming time-to-first-token can be very large on cold start / GPU warmup; readiness/phase contract unclear (first-token timeouts) #452

Summary

What I observe (intermittent)

Expected behavior / docs that would help

Questions

Temporary client-side mitigation

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Streaming time-to-first-token can be very large on cold start / GPU warmup; readiness/phase contract unclear (first-token timeouts) #452

Description

Summary

What I observe (intermittent)

Expected behavior / docs that would help

Questions

Temporary client-side mitigation

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions