[BidiGenerateContent] Model ignores realtimeInput.text injection — receives text but produces no audio response

## Environment
- **Model:** `gemini-3.1-flash-live-preview`
- **Voice:** `Leda`
- **Transport:** Raw WebSocket v1beta `wss://generativelanguage.googleapis.com/ws/.../BidiGenerateContent`
- **Audio pipeline:** Twilio MediaStreams (mulaw 8kHz) → PCM16 16kHz → Gemini → PCM16 24kHz → mulaw 8kHz
- **Platform:** Node.js (`ws` library, raw WebSocket, no SDK)
- **Scale:** ~600 production phone calls over the past 14 days

## Bug Description

When sending `realtimeInput.text` with a script for the model to speak (e.g., a voicemail message), the model sometimes receives the text (WebSocket `send()` succeeds, no error) but never produces audio output. The model goes completely silent — no `serverContent` with audio data, no `turnComplete`, no error.

This is critical for our voicemail delivery workflow. When our system detects a voicemail greeting, we inject a text instruction via `realtimeInput.text` telling the model to speak a specific voicemail message. Approximately 5% of the time, the model receives this injection but never speaks.

## Reproduction Steps

1. Establish a Live API session with audio response modality
2. Stream audio input (caller audio flowing normally)
3. Send `realtimeInput.text` with a multi-sentence script (100-200 words)
4. Wait for audio output — none arrives
5. WebSocket remains open and healthy
6. No error messages, no close codes

## Impact

Over a 14-day production window:
- **21 voicemail calls** had no transcript or transcript under 50 characters (failed VM delivery)
- **401 voicemail calls** delivered successfully
- ~5% VM delivery failure rate from this specific issue
- Failed deliveries mean the caller never hears our voicemail — a wasted call and a missed lead contact

## Workarounds Attempted

1. **Nudge timer (4 seconds):** After injecting the VM script, we start a 4-second timer. If no audio output arrives, we re-inject the script with stronger language ("You MUST speak now."). This recovers ~50% of frozen deliveries.
2. **35-second safety timeout:** If Gemini still hasn't spoken after 35 seconds, we hang up and log the failure.
3. **Multiple nudge attempts:** Up to 2 re-injection attempts before giving up.

## Expected Behavior

When `realtimeInput.text` is sent with a script, the model should produce audio output speaking the provided text. If the model cannot process the text for any reason, it should return an error or status signal — not silent failure.

## Questions for the Team

1. Is there a known issue with `realtimeInput.text` being silently dropped?
2. Is there a maximum text length that `realtimeInput.text` reliably handles?
3. Does `realtimeInput.text` conflict with ongoing audio input processing? (We continue streaming caller audio while injecting text.)
4. Is there a signal we can monitor to confirm the model received and is processing the text injection?

## Related Issues
- google-gemini/cookbook#1225 (audio output freeze — same root cause may apply)
- google-gemini/cookbook#1197 (our previous report — 13 issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BidiGenerateContent] Model ignores realtimeInput.text injection — receives text but produces no audio response #1226

Environment

Bug Description

Reproduction Steps

Impact

Workarounds Attempted

Expected Behavior

Questions for the Team

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BidiGenerateContent] Model ignores realtimeInput.text injection — receives text but produces no audio response #1226

Description

Environment

Bug Description

Reproduction Steps

Impact

Workarounds Attempted

Expected Behavior

Questions for the Team

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions