Summary
We operate a production voice AI system (real estate listing coordinator) that uses gemini-3.1-flash-live-preview via the raw WebSocket v1beta BidiGenerateContent endpoint, bridged to Twilio MediaStreams for live phone calls. We've encountered four issues that affect every call.
Environment
- Model:
gemini-3.1-flash-live-preview
- Voice:
Kore
- Transport: Raw WebSocket v1beta —
wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent?key=...
- Audio pipeline: Twilio MediaStreams (mulaw 8kHz) → PCM16 16kHz conversion → Gemini
- VAD config:
automaticActivityDetection with startOfSpeechSensitivity: START_SENSITIVITY_HIGH, endOfSpeechSensitivity: END_SENSITIVITY_HIGH, prefixPaddingMs: 20, silenceDurationMs: 500
- System instruction size: ~14K-33K characters
- Platform: Node.js (
ws library, no SDK), production system handling real calls daily
Issue 1: Stuttering/repeating on first turn (VAD self-interruption)
Behavior: When Gemini starts speaking on an outbound call, the caller's "Hello?" triggers VAD which interrupts Gemini mid-greeting. Gemini then restarts the greeting from the beginning, causing audible stuttering/repeating.
Reproduction: Occurs consistently on outbound calls when the caller picks up and says "Hello?" while Gemini is delivering its opening greeting. Correlates with large system instructions (~14K-33K chars).
Workaround: 2.5-second audio suppression shield after kickoff + recovery cue injection via realtimeInput.text.
Expected: First turn should have barge-in protection or the model should continue rather than restart when interrupted by brief speech.
Issue 2: False language switching despite explicit system instruction
Behavior: Despite the system instruction explicitly stating "Start in English" and "ONLY switch to Spanish if the caller explicitly asks," the model switches to Spanish when it hears a Spanish-sounding name (e.g., "Jose" pronounced "Ho-ZAY") — even though the caller responded in English ("this is he").
Reproduction: Call a contact named "Jose" or similar Spanish-origin name. When they answer in English, Gemini switches to Spanish.
Workaround: Extremely strict language rules: "NEVER guess the caller's language from their name, accent, or tone."
Expected: The model should respect explicit language instructions and not infer language from names or accents.
Issue 3: outputTranscription word splitting across serverContent messages
Behavior: When the caller interrupts Gemini mid-sentence, outputTranscription splits words across multiple serverContent messages.
Example: Gemini saying "Take care." gets split into:
- Message 1:
outputTranscription: "car"
- Message 2:
outputTranscription: "e."
Also, inputTranscription and outputTranscription arrive in the same serverContent message (~10ms apart) during interruptions.
Workaround: Post-call re-transcription using Gemini 2.5 Flash on the call recording.
Expected: outputTranscription should deliver complete words. Input/output transcriptions should not arrive in the same message, or should include timestamps for speaker diarization.
Issue 4: ~3 second cold-start latency on first turn
Behavior: First outbound turn takes ~3 seconds from kickoff to TURN_COMPLETE. Subsequent turns are fast (~500ms). Scales with system instruction size.
Impact: 3-second dead air on outbound calls. Callers say "Hello?" multiple times or hang up. Creates the window that triggers Issue 1.
Missing capability: No pre-warming mechanism for the WebSocket. Cannot send system instruction ahead of time before the call connects.
Expected: First-turn latency comparable to subsequent turns, or a pre-warming mechanism.
Production Impact
This system handles real estate seller leads via live phone calls daily. These issues affect real customer interactions:
- Issue 1 (stuttering) makes the AI sound broken on first impression
- Issue 2 (language switching) confuses English-speaking callers
- Issue 3 (transcript splitting) degrades call records and analytics
- Issue 4 (cold-start latency) causes awkward silences that lose caller engagement
We have workarounds for all four, but native fixes would substantially improve the developer experience for voice applications.
Happy to provide audio samples, transcript logs, or WebSocket message dumps if helpful.
Summary
We operate a production voice AI system (real estate listing coordinator) that uses
gemini-3.1-flash-live-previewvia the raw WebSocket v1betaBidiGenerateContentendpoint, bridged to Twilio MediaStreams for live phone calls. We've encountered four issues that affect every call.Environment
gemini-3.1-flash-live-previewKorewss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent?key=...automaticActivityDetectionwithstartOfSpeechSensitivity: START_SENSITIVITY_HIGH,endOfSpeechSensitivity: END_SENSITIVITY_HIGH,prefixPaddingMs: 20,silenceDurationMs: 500wslibrary, no SDK), production system handling real calls dailyIssue 1: Stuttering/repeating on first turn (VAD self-interruption)
Behavior: When Gemini starts speaking on an outbound call, the caller's "Hello?" triggers VAD which interrupts Gemini mid-greeting. Gemini then restarts the greeting from the beginning, causing audible stuttering/repeating.
Reproduction: Occurs consistently on outbound calls when the caller picks up and says "Hello?" while Gemini is delivering its opening greeting. Correlates with large system instructions (~14K-33K chars).
Workaround: 2.5-second audio suppression shield after kickoff + recovery cue injection via
realtimeInput.text.Expected: First turn should have barge-in protection or the model should continue rather than restart when interrupted by brief speech.
Issue 2: False language switching despite explicit system instruction
Behavior: Despite the system instruction explicitly stating "Start in English" and "ONLY switch to Spanish if the caller explicitly asks," the model switches to Spanish when it hears a Spanish-sounding name (e.g., "Jose" pronounced "Ho-ZAY") — even though the caller responded in English ("this is he").
Reproduction: Call a contact named "Jose" or similar Spanish-origin name. When they answer in English, Gemini switches to Spanish.
Workaround: Extremely strict language rules: "NEVER guess the caller's language from their name, accent, or tone."
Expected: The model should respect explicit language instructions and not infer language from names or accents.
Issue 3:
outputTranscriptionword splitting acrossserverContentmessagesBehavior: When the caller interrupts Gemini mid-sentence,
outputTranscriptionsplits words across multipleserverContentmessages.Example: Gemini saying "Take care." gets split into:
outputTranscription: "car"outputTranscription: "e."Also,
inputTranscriptionandoutputTranscriptionarrive in the sameserverContentmessage (~10ms apart) during interruptions.Workaround: Post-call re-transcription using Gemini 2.5 Flash on the call recording.
Expected:
outputTranscriptionshould deliver complete words. Input/output transcriptions should not arrive in the same message, or should include timestamps for speaker diarization.Issue 4: ~3 second cold-start latency on first turn
Behavior: First outbound turn takes ~3 seconds from kickoff to
TURN_COMPLETE. Subsequent turns are fast (~500ms). Scales with system instruction size.Impact: 3-second dead air on outbound calls. Callers say "Hello?" multiple times or hang up. Creates the window that triggers Issue 1.
Missing capability: No pre-warming mechanism for the WebSocket. Cannot send system instruction ahead of time before the call connects.
Expected: First-turn latency comparable to subsequent turns, or a pre-warming mechanism.
Production Impact
This system handles real estate seller leads via live phone calls daily. These issues affect real customer interactions:
We have workarounds for all four, but native fixes would substantially improve the developer experience for voice applications.
Happy to provide audio samples, transcript logs, or WebSocket message dumps if helpful.