Skip to content

[BidiGenerateContent] Multiple issues with gemini-3.1-flash-live-preview in production voice calls #1197

@Hprg

Description

@Hprg

Summary

We operate a production voice AI system (real estate listing coordinator) that uses gemini-3.1-flash-live-preview via the raw WebSocket v1beta BidiGenerateContent endpoint, bridged to Twilio MediaStreams for live phone calls. We've encountered four issues that affect every call.

Environment

  • Model: gemini-3.1-flash-live-preview
  • Voice: Kore
  • Transport: Raw WebSocket v1beta — wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent?key=...
  • Audio pipeline: Twilio MediaStreams (mulaw 8kHz) → PCM16 16kHz conversion → Gemini
  • VAD config: automaticActivityDetection with startOfSpeechSensitivity: START_SENSITIVITY_HIGH, endOfSpeechSensitivity: END_SENSITIVITY_HIGH, prefixPaddingMs: 20, silenceDurationMs: 500
  • System instruction size: ~14K-33K characters
  • Platform: Node.js (ws library, no SDK), production system handling real calls daily

Issue 1: Stuttering/repeating on first turn (VAD self-interruption)

Behavior: When Gemini starts speaking on an outbound call, the caller's "Hello?" triggers VAD which interrupts Gemini mid-greeting. Gemini then restarts the greeting from the beginning, causing audible stuttering/repeating.

Reproduction: Occurs consistently on outbound calls when the caller picks up and says "Hello?" while Gemini is delivering its opening greeting. Correlates with large system instructions (~14K-33K chars).

Workaround: 2.5-second audio suppression shield after kickoff + recovery cue injection via realtimeInput.text.

Expected: First turn should have barge-in protection or the model should continue rather than restart when interrupted by brief speech.


Issue 2: False language switching despite explicit system instruction

Behavior: Despite the system instruction explicitly stating "Start in English" and "ONLY switch to Spanish if the caller explicitly asks," the model switches to Spanish when it hears a Spanish-sounding name (e.g., "Jose" pronounced "Ho-ZAY") — even though the caller responded in English ("this is he").

Reproduction: Call a contact named "Jose" or similar Spanish-origin name. When they answer in English, Gemini switches to Spanish.

Workaround: Extremely strict language rules: "NEVER guess the caller's language from their name, accent, or tone."

Expected: The model should respect explicit language instructions and not infer language from names or accents.


Issue 3: outputTranscription word splitting across serverContent messages

Behavior: When the caller interrupts Gemini mid-sentence, outputTranscription splits words across multiple serverContent messages.

Example: Gemini saying "Take care." gets split into:

  • Message 1: outputTranscription: "car"
  • Message 2: outputTranscription: "e."

Also, inputTranscription and outputTranscription arrive in the same serverContent message (~10ms apart) during interruptions.

Workaround: Post-call re-transcription using Gemini 2.5 Flash on the call recording.

Expected: outputTranscription should deliver complete words. Input/output transcriptions should not arrive in the same message, or should include timestamps for speaker diarization.


Issue 4: ~3 second cold-start latency on first turn

Behavior: First outbound turn takes ~3 seconds from kickoff to TURN_COMPLETE. Subsequent turns are fast (~500ms). Scales with system instruction size.

Impact: 3-second dead air on outbound calls. Callers say "Hello?" multiple times or hang up. Creates the window that triggers Issue 1.

Missing capability: No pre-warming mechanism for the WebSocket. Cannot send system instruction ahead of time before the call connects.

Expected: First-turn latency comparable to subsequent turns, or a pre-warming mechanism.


Production Impact

This system handles real estate seller leads via live phone calls daily. These issues affect real customer interactions:

  • Issue 1 (stuttering) makes the AI sound broken on first impression
  • Issue 2 (language switching) confuses English-speaking callers
  • Issue 3 (transcript splitting) degrades call records and analytics
  • Issue 4 (cold-start latency) causes awkward silences that lose caller engagement

We have workarounds for all four, but native fixes would substantially improve the developer experience for voice applications.

Happy to provide audio samples, transcript logs, or WebSocket message dumps if helpful.

Metadata

Metadata

Labels

off-topic:model qualityNot cookbook related: model quality issue.type:bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions