Skip to content

[BidiGenerateContent] Model ignores realtimeInput.text injection — receives text but produces no audio response #1226

@Hprg

Description

@Hprg

Environment

  • Model: gemini-3.1-flash-live-preview
  • Voice: Leda
  • Transport: Raw WebSocket v1beta wss://generativelanguage.googleapis.com/ws/.../BidiGenerateContent
  • Audio pipeline: Twilio MediaStreams (mulaw 8kHz) → PCM16 16kHz → Gemini → PCM16 24kHz → mulaw 8kHz
  • Platform: Node.js (ws library, raw WebSocket, no SDK)
  • Scale: ~600 production phone calls over the past 14 days

Bug Description

When sending realtimeInput.text with a script for the model to speak (e.g., a voicemail message), the model sometimes receives the text (WebSocket send() succeeds, no error) but never produces audio output. The model goes completely silent — no serverContent with audio data, no turnComplete, no error.

This is critical for our voicemail delivery workflow. When our system detects a voicemail greeting, we inject a text instruction via realtimeInput.text telling the model to speak a specific voicemail message. Approximately 5% of the time, the model receives this injection but never speaks.

Reproduction Steps

  1. Establish a Live API session with audio response modality
  2. Stream audio input (caller audio flowing normally)
  3. Send realtimeInput.text with a multi-sentence script (100-200 words)
  4. Wait for audio output — none arrives
  5. WebSocket remains open and healthy
  6. No error messages, no close codes

Impact

Over a 14-day production window:

  • 21 voicemail calls had no transcript or transcript under 50 characters (failed VM delivery)
  • 401 voicemail calls delivered successfully
  • ~5% VM delivery failure rate from this specific issue
  • Failed deliveries mean the caller never hears our voicemail — a wasted call and a missed lead contact

Workarounds Attempted

  1. Nudge timer (4 seconds): After injecting the VM script, we start a 4-second timer. If no audio output arrives, we re-inject the script with stronger language ("You MUST speak now."). This recovers ~50% of frozen deliveries.
  2. 35-second safety timeout: If Gemini still hasn't spoken after 35 seconds, we hang up and log the failure.
  3. Multiple nudge attempts: Up to 2 re-injection attempts before giving up.

Expected Behavior

When realtimeInput.text is sent with a script, the model should produce audio output speaking the provided text. If the model cannot process the text for any reason, it should return an error or status signal — not silent failure.

Questions for the Team

  1. Is there a known issue with realtimeInput.text being silently dropped?
  2. Is there a maximum text length that realtimeInput.text reliably handles?
  3. Does realtimeInput.text conflict with ongoing audio input processing? (We continue streaming caller audio while injecting text.)
  4. Is there a signal we can monitor to confirm the model received and is processing the text injection?

Related Issues

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions