Skip to content

[BidiGenerateContent] Model freezes after serverContent.interrupted — no audio output after barge-in #1228

@Hprg

Description

@Hprg

Environment

  • Model: gemini-3.1-flash-live-preview
  • Voice: Leda
  • Transport: Raw WebSocket v1beta wss://generativelanguage.googleapis.com/ws/.../BidiGenerateContent
  • Audio pipeline: Twilio MediaStreams (mulaw 8kHz) → PCM16 16kHz → Gemini → PCM16 24kHz → mulaw 8kHz
  • Platform: Node.js (ws library, raw WebSocket, no SDK)

Bug Description

When the caller interrupts the model (barge-in) and serverContent.interrupted fires, the model sometimes fails to resume generating audio after the interruption. The caller finishes speaking and waits for a response, but the model produces no audio output. The WebSocket remains open, no error is returned.

Normal barge-in behavior:

  1. Model is speaking
  2. Caller talks over the model → serverContent.interrupted fires
  3. Model stops speaking (correct)
  4. Caller finishes their turn
  5. Model should respond with audio → but sometimes it never does

The model enters a state where it has acknowledged the interruption (stopped its own audio) but never starts generating a new response. The caller is left in silence.

Impact

This is particularly damaging on phone calls because:

  • The caller spoke (they're engaged and waiting for a response)
  • Silence after someone speaks is unnatural and causes callers to hang up
  • These are live sales conversations — losing them means losing business

Workarounds Implemented

  1. Post-interruption nudge timer (4 seconds): After every serverContent.interrupted event, we start a 4-second timer. If the model hasn't produced audio within 4 seconds, we inject realtimeInput.text: "Your previous response was interrupted. Respond now with a SHORT reply." This recovers the model in some cases.
  2. Dead call watchdog (10 seconds): If mutual silence persists for 10 seconds (including after nudge attempts), we kill the call.

Expected Behavior

After serverContent.interrupted fires and the caller finishes speaking, the model should process the caller's speech and generate an audio response — the same as it would for any turn transition. Interruption should not put the model into a permanently silent state.

Questions for the Team

  1. Is there a known state machine issue where interrupted prevents the model from starting a new generation?
  2. Does activityHandling: 'START_OF_ACTIVITY_INTERRUPTS' interact poorly with certain conversation patterns?
  3. Is there a recommended way to "reset" the model's turn state after an interruption?
  4. Would switching to manual VAD (activityHandling: 'NO_INTERRUPTIONS') during model output and back to auto afterward help avoid this?

Related Issues

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions