Skip to content

[BidiGenerateContent] Model audio output freezes mid-conversation — stops producing audio with no error #1225

@Hprg

Description

@Hprg

Environment

  • Model: gemini-3.1-flash-live-preview
  • Voice: Leda
  • Transport: Raw WebSocket v1beta wss://generativelanguage.googleapis.com/ws/.../BidiGenerateContent
  • Audio pipeline: Twilio MediaStreams (mulaw 8kHz) → PCM16 16kHz → Gemini → PCM16 24kHz → mulaw 8kHz
  • VAD: automaticActivityDetectionSTART_SENSITIVITY_HIGH, END_SENSITIVITY_HIGH, prefixPaddingMs: 150, silenceDurationMs: 700
  • System instruction: ~17K characters (~4K tokens)
  • Platform: Node.js (ws library, raw WebSocket, no SDK)
  • Scale: ~600 production phone calls over the past 14 days

Bug Description

The model randomly stops producing audio output mid-conversation. The WebSocket connection remains open, no error is returned, serverContent messages stop arriving, but the connection does not close. The model simply goes silent.

This is the single biggest issue affecting our production voice agent. It happens on both inbound and outbound calls, at any point in the conversation — during the greeting, mid-sentence, or after processing caller input.

Reproduction

This is non-deterministic and cannot be reliably reproduced. It happens across different callers, different times of day, and different conversation topics. The only consistent pattern is:

  • WebSocket is open and healthy
  • Audio input is flowing from the caller (we can see raw audio chunks arriving)
  • inputTranscription events are still arriving (caller speech is being transcribed)
  • But serverContent with audio data stops completely
  • No turnComplete, no generationComplete, no error — just silence

Impact

Over a 14-day production window (603 total calls):

  • 66 calls ended with outcome: unknown — the majority caused by this audio freeze
  • 65 of those were under 15 seconds — the model froze before any meaningful conversation could happen
  • Our server-side watchdog kills frozen calls after 10 seconds of mutual silence
  • Real example ([Caller], May 9 2026): Caller phoned in, The AI agent greeted: "Hi, thanks for calling [Company]." Caller responded: "I I I you guys sent me a message because I have some selling my house at this." — then Gemini froze. Call died at 12 seconds. The caller had to call back 4.5 hours later to get through.

Workarounds Attempted

  1. Dead call watchdog — 10-second silence timer kills frozen calls and attempts voicemail delivery. Works but loses the live conversation.
  2. Text nudge injection — When freeze is detected and caller speech was recent, we inject realtimeInput.text telling the model to respond. Works ~30% of the time.
  3. Post-interruption nudge — 4-second timer after serverContent.interrupted events, since freezes often follow interruptions. Nudges the model to resume.
  4. Pre-warm WebSocket — We open the Gemini WebSocket and send the setup message during the Twilio TwiML fetch (before call connects) to eliminate cold-start. This helps with first-turn latency but does not prevent mid-conversation freezes.

None of these fix the root cause. The model simply stops generating audio and no amount of text injection or waiting recovers it reliably.

Configuration

{
  generationConfig: {
    responseModalities: ['AUDIO'],
    speechConfig: {
      voiceConfig: { prebuiltVoiceConfig: { voiceName: 'Leda' } }
    }
  },
  realtimeInputConfig: {
    automaticActivityDetection: {
      startOfSpeechSensitivity: 'START_SENSITIVITY_HIGH',
      endOfSpeechSensitivity: 'END_SENSITIVITY_HIGH',
      prefixPaddingMs: 150,
      silenceDurationMs: 700
    },
    activityHandling: 'START_OF_ACTIVITY_INTERRUPTS',
    turnCoverage: 'TURN_INCLUDES_ONLY_ACTIVITY'
  },
  contextWindowCompression: {
    slidingWindow: {}
  },
  sessionResumption: {},
  systemInstruction: { parts: [{ text: '...' }] },
  inputAudioTranscription: {},
  outputAudioTranscription: {}
}

Questions for the Team

  1. Is there a known issue with audio generation stalling on gemini-3.1-flash-live-preview?
  2. Does contextWindowCompression: { slidingWindow: {} } affect audio output stability? GitHub issue Automatically add specific labels to PRs #117 in google-gemini/live-api-web-console suggests a correlation.
  3. Is there a recommended recovery mechanism when the model stops producing audio but the WebSocket remains open?
  4. Are there diagnostic signals we should be monitoring that would predict or explain these freezes?

Related Issues

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions