Skip to content

Streaming time-to-first-token can be very large on cold start / GPU warmup; readiness/phase contract unclear (first-token timeouts) #452

@ishizuki-tech

Description

@ishizuki-tech

I’m seeing intermittent but sometimes very large time-to-first-token (TTFT) latency when using callback streaming. This looks like cold initialization and/or GPU warmup (and possibly internal queueing) happening before the first streamed token is delivered.

Without a readiness/phase signal, it’s hard to distinguish “warming up / queued” from “stalled,” and reasonable first-token watchdogs can trigger unnecessarily.

Summary

With callback streaming (Conversation.sendMessageAsync(text, MessageCallback)), the first onMessage(...) can be delayed far longer than steady-state token streaming.

This is most visible:

  • on cold start (fresh process / newly loaded model / first GPU use), and
  • in tight back-to-back runs (e.g., Step1 → Step2) where the next request starts immediately.

What I observe (intermittent)

  • The first onMessage(...) can arrive much later than expected.

  • Latency is inconsistent:

    • fast in warm/steady state,
    • sometimes very slow in cold state.
  • In back-to-back runs, the next request can hit first-token timeouts more often.

  • During the pre-first-token gap, there is no explicit signal indicating whether the engine is:

    • initializing/warming up,
    • queued behind other work, or
    • stalled.

Expected behavior / docs that would help

  • Document streaming phases (init/warmup vs active generation) and expected TTFT behavior.
  • Ideally provide a readiness/warmed-up signal or recommended probe.
  • Guidance on timeout strategy (init vs first token) and session reuse/pre-warming best practices.

Questions

  1. Is there a defined separation between initialization/warmup and generation phases for streaming?
  2. Is there any readiness/warmed-up signal (or recommended probe) that indicates “first token should arrive promptly”?
  3. Typical contributors to TTFT (CPU vs GPU, model loading, cache building, internal queueing)?
  4. Best practices for session reuse / pre-warming to avoid repeated warmup cost?

Temporary client-side mitigation

  • Separate watchdogs:

    • INIT_TIMEOUT_MS for model/session creation + warmup
    • FIRST_TOKEN_TIMEOUT_MS for TTFT after starting a streaming run
  • On init timeout/failure, prefer recovery (reset/recreate) rather than hanging UI/state

Environment

  • LiteRT‑LM: com.google.ai.edge.litertlm:litertlm-android:0.9.0-alpha04

    • Version correlation: observed more frequently with com.google.ai.edge.litertlm:litertlm-android:0.8.0.
  • Android / device: Google Pixel 9a (API 36)

  • ABI: 'arm64-v8a`

  • Model / backend / runtime config:

    • Model: google/gemma-3n-E4B-it-int4.litertlm

    • Source: https://huggingface.co/google/gemma-3n-E4B-it-litert-lm/resolve/main/gemma-3n-E4B-it-int4.litertlm

    • Backend: GPU (slm.accelerator=GPU)

    • Runtime: max_tokens=4096, top_k=64, top_p=0.95, temperature=1.0

    • Turn formatting:

      • user_turn_prefix: "<start_of_turn>user"
      • model_turn_prefix: "<start_of_turn>model"
      • turn_end: "<end_of_turn>"
  • Build environment: AGP 9.0.0, Kotlin 2.3.10 (Compose BOM 2026.02.00)

Metadata

Metadata

Assignees

Labels

area:hw-accelLeveraging specific hardware for performance.area:perf-issuesBugs resulting in slow performance or high resource usage.type:featureRequest for new functionality or enhancement.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions