[BidiGenerateContent] Model audio output freezes mid-conversation — stops producing audio with no error

## Environment
- **Model:** `gemini-3.1-flash-live-preview`
- **Voice:** `Leda`
- **Transport:** Raw WebSocket v1beta `wss://generativelanguage.googleapis.com/ws/.../BidiGenerateContent`
- **Audio pipeline:** Twilio MediaStreams (mulaw 8kHz) → PCM16 16kHz → Gemini → PCM16 24kHz → mulaw 8kHz
- **VAD:** `automaticActivityDetection` — `START_SENSITIVITY_HIGH`, `END_SENSITIVITY_HIGH`, `prefixPaddingMs: 150`, `silenceDurationMs: 700`
- **System instruction:** ~17K characters (~4K tokens)
- **Platform:** Node.js (`ws` library, raw WebSocket, no SDK)
- **Scale:** ~600 production phone calls over the past 14 days

## Bug Description

The model randomly stops producing audio output mid-conversation. The WebSocket connection remains open, no error is returned, `serverContent` messages stop arriving, but the connection does not close. The model simply goes silent.

This is the single biggest issue affecting our production voice agent. It happens on both inbound and outbound calls, at any point in the conversation — during the greeting, mid-sentence, or after processing caller input.

## Reproduction

This is non-deterministic and cannot be reliably reproduced. It happens across different callers, different times of day, and different conversation topics. The only consistent pattern is:
- WebSocket is open and healthy
- Audio input is flowing from the caller (we can see raw audio chunks arriving)
- `inputTranscription` events are still arriving (caller speech is being transcribed)
- But `serverContent` with audio data stops completely
- No `turnComplete`, no `generationComplete`, no error — just silence

## Impact

Over a 14-day production window (603 total calls):
- **66 calls** ended with `outcome: unknown` — the majority caused by this audio freeze
- **65 of those** were under 15 seconds — the model froze before any meaningful conversation could happen
- Our server-side watchdog kills frozen calls after 10 seconds of mutual silence
- **Real example ([Caller], May 9 2026):** Caller phoned in, The AI agent greeted: "Hi, thanks for calling [Company]." Caller responded: "I I I you guys sent me a message because I have some selling my house at this." — then Gemini froze. Call died at 12 seconds. The caller had to call back 4.5 hours later to get through.

## Workarounds Attempted

1. **Dead call watchdog** — 10-second silence timer kills frozen calls and attempts voicemail delivery. Works but loses the live conversation.
2. **Text nudge injection** — When freeze is detected and caller speech was recent, we inject `realtimeInput.text` telling the model to respond. Works ~30% of the time.
3. **Post-interruption nudge** — 4-second timer after `serverContent.interrupted` events, since freezes often follow interruptions. Nudges the model to resume.
4. **Pre-warm WebSocket** — We open the Gemini WebSocket and send the setup message during the Twilio TwiML fetch (before call connects) to eliminate cold-start. This helps with first-turn latency but does not prevent mid-conversation freezes.

None of these fix the root cause. The model simply stops generating audio and no amount of text injection or waiting recovers it reliably.

## Configuration

```javascript
{
  generationConfig: {
    responseModalities: ['AUDIO'],
    speechConfig: {
      voiceConfig: { prebuiltVoiceConfig: { voiceName: 'Leda' } }
    }
  },
  realtimeInputConfig: {
    automaticActivityDetection: {
      startOfSpeechSensitivity: 'START_SENSITIVITY_HIGH',
      endOfSpeechSensitivity: 'END_SENSITIVITY_HIGH',
      prefixPaddingMs: 150,
      silenceDurationMs: 700
    },
    activityHandling: 'START_OF_ACTIVITY_INTERRUPTS',
    turnCoverage: 'TURN_INCLUDES_ONLY_ACTIVITY'
  },
  contextWindowCompression: {
    slidingWindow: {}
  },
  sessionResumption: {},
  systemInstruction: { parts: [{ text: '...' }] },
  inputAudioTranscription: {},
  outputAudioTranscription: {}
}
```

## Questions for the Team

1. Is there a known issue with audio generation stalling on `gemini-3.1-flash-live-preview`?
2. Does `contextWindowCompression: { slidingWindow: {} }` affect audio output stability? GitHub issue #117 in `google-gemini/live-api-web-console` suggests a correlation.
3. Is there a recommended recovery mechanism when the model stops producing audio but the WebSocket remains open?
4. Are there diagnostic signals we should be monitoring that would predict or explain these freezes?

## Related Issues
- google-gemini/live-api-web-console#117 (audio stops midway)
- google-gemini/cookbook#977 (LiveAPI stop talking)
- google-gemini/cookbook#1197 (our previous report — issues 1-13)
- googleapis/js-genai#707 (premature turnComplete)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BidiGenerateContent] Model audio output freezes mid-conversation — stops producing audio with no error #1225

Environment

Bug Description

Reproduction

Impact

Workarounds Attempted

Configuration

Questions for the Team

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BidiGenerateContent] Model audio output freezes mid-conversation — stops producing audio with no error #1225

Description

Environment

Bug Description

Reproduction

Impact

Workarounds Attempted

Configuration

Questions for the Team

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions