-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
I’m seeing intermittent but sometimes very large time-to-first-token (TTFT) latency when using callback streaming. This looks like cold initialization and/or GPU warmup (and possibly internal queueing) happening before the first streamed token is delivered.
Without a readiness/phase signal, it’s hard to distinguish “warming up / queued” from “stalled,” and reasonable first-token watchdogs can trigger unnecessarily.
Summary
With callback streaming (Conversation.sendMessageAsync(text, MessageCallback)), the first onMessage(...) can be delayed far longer than steady-state token streaming.
This is most visible:
- on cold start (fresh process / newly loaded model / first GPU use), and
- in tight back-to-back runs (e.g., Step1 → Step2) where the next request starts immediately.
What I observe (intermittent)
-
The first
onMessage(...)can arrive much later than expected. -
Latency is inconsistent:
- fast in warm/steady state,
- sometimes very slow in cold state.
-
In back-to-back runs, the next request can hit first-token timeouts more often.
-
During the pre-first-token gap, there is no explicit signal indicating whether the engine is:
- initializing/warming up,
- queued behind other work, or
- stalled.
Expected behavior / docs that would help
- Document streaming phases (init/warmup vs active generation) and expected TTFT behavior.
- Ideally provide a readiness/warmed-up signal or recommended probe.
- Guidance on timeout strategy (init vs first token) and session reuse/pre-warming best practices.
Questions
- Is there a defined separation between initialization/warmup and generation phases for streaming?
- Is there any readiness/warmed-up signal (or recommended probe) that indicates “first token should arrive promptly”?
- Typical contributors to TTFT (CPU vs GPU, model loading, cache building, internal queueing)?
- Best practices for session reuse / pre-warming to avoid repeated warmup cost?
Temporary client-side mitigation
-
Separate watchdogs:
INIT_TIMEOUT_MSfor model/session creation + warmupFIRST_TOKEN_TIMEOUT_MSfor TTFT after starting a streaming run
-
On init timeout/failure, prefer recovery (reset/recreate) rather than hanging UI/state
Environment
-
LiteRT‑LM:
com.google.ai.edge.litertlm:litertlm-android:0.9.0-alpha04- Version correlation: observed more frequently with
com.google.ai.edge.litertlm:litertlm-android:0.8.0.
- Version correlation: observed more frequently with
-
Android / device: Google Pixel 9a (API 36)
-
ABI: 'arm64-v8a`
-
Model / backend / runtime config:
-
Model:
google/gemma-3n-E4B-it-int4.litertlm -
Source:
https://huggingface.co/google/gemma-3n-E4B-it-litert-lm/resolve/main/gemma-3n-E4B-it-int4.litertlm -
Backend: GPU (
slm.accelerator=GPU) -
Runtime:
max_tokens=4096, top_k=64, top_p=0.95, temperature=1.0 -
Turn formatting:
user_turn_prefix:"<start_of_turn>user"model_turn_prefix:"<start_of_turn>model"turn_end:"<end_of_turn>"
-
-
Build environment: AGP 9.0.0, Kotlin 2.3.10 (Compose BOM 2026.02.00)