[BidiGenerateContent] Multiple issues with gemini-3.1-flash-live-preview in production voice calls

## Summary

We operate a production voice AI system (real estate listing coordinator) that uses `gemini-3.1-flash-live-preview` via the raw WebSocket v1beta `BidiGenerateContent` endpoint, bridged to Twilio MediaStreams for live phone calls. We've encountered four issues that affect every call.

## Environment

- **Model:** `gemini-3.1-flash-live-preview`
- **Voice:** `Kore`
- **Transport:** Raw WebSocket v1beta — `wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent?key=...`
- **Audio pipeline:** Twilio MediaStreams (mulaw 8kHz) → PCM16 16kHz conversion → Gemini
- **VAD config:** `automaticActivityDetection` with `startOfSpeechSensitivity: START_SENSITIVITY_HIGH`, `endOfSpeechSensitivity: END_SENSITIVITY_HIGH`, `prefixPaddingMs: 20`, `silenceDurationMs: 500`
- **System instruction size:** ~14K-33K characters
- **Platform:** Node.js (`ws` library, no SDK), production system handling real calls daily

---

## Issue 1: Stuttering/repeating on first turn (VAD self-interruption)

**Behavior:** When Gemini starts speaking on an outbound call, the caller's "Hello?" triggers VAD which interrupts Gemini mid-greeting. Gemini then restarts the greeting from the beginning, causing audible stuttering/repeating.

**Reproduction:** Occurs consistently on outbound calls when the caller picks up and says "Hello?" while Gemini is delivering its opening greeting. Correlates with large system instructions (~14K-33K chars).

**Workaround:** 2.5-second audio suppression shield after kickoff + recovery cue injection via `realtimeInput.text`.

**Expected:** First turn should have barge-in protection or the model should continue rather than restart when interrupted by brief speech.

---

## Issue 2: False language switching despite explicit system instruction

**Behavior:** Despite the system instruction explicitly stating "Start in English" and "ONLY switch to Spanish if the caller explicitly asks," the model switches to Spanish when it hears a Spanish-sounding name (e.g., "Jose" pronounced "Ho-ZAY") — even though the caller responded in English ("this is he").

**Reproduction:** Call a contact named "Jose" or similar Spanish-origin name. When they answer in English, Gemini switches to Spanish.

**Workaround:** Extremely strict language rules: "NEVER guess the caller's language from their name, accent, or tone."

**Expected:** The model should respect explicit language instructions and not infer language from names or accents.

---

## Issue 3: `outputTranscription` word splitting across `serverContent` messages

**Behavior:** When the caller interrupts Gemini mid-sentence, `outputTranscription` splits words across multiple `serverContent` messages.

**Example:** Gemini saying "Take care." gets split into:
- Message 1: `outputTranscription: "car"`
- Message 2: `outputTranscription: "e."`

Also, `inputTranscription` and `outputTranscription` arrive in the same `serverContent` message (~10ms apart) during interruptions.

**Workaround:** Post-call re-transcription using Gemini 2.5 Flash on the call recording.

**Expected:** `outputTranscription` should deliver complete words. Input/output transcriptions should not arrive in the same message, or should include timestamps for speaker diarization.

---

## Issue 4: ~3 second cold-start latency on first turn

**Behavior:** First outbound turn takes ~3 seconds from kickoff to `TURN_COMPLETE`. Subsequent turns are fast (~500ms). Scales with system instruction size.

**Impact:** 3-second dead air on outbound calls. Callers say "Hello?" multiple times or hang up. Creates the window that triggers Issue 1.

**Missing capability:** No pre-warming mechanism for the WebSocket. Cannot send system instruction ahead of time before the call connects.

**Expected:** First-turn latency comparable to subsequent turns, or a pre-warming mechanism.

---

## Production Impact

This system handles real estate seller leads via live phone calls daily. These issues affect real customer interactions:

- **Issue 1** (stuttering) makes the AI sound broken on first impression
- **Issue 2** (language switching) confuses English-speaking callers
- **Issue 3** (transcript splitting) degrades call records and analytics
- **Issue 4** (cold-start latency) causes awkward silences that lose caller engagement

We have workarounds for all four, but native fixes would substantially improve the developer experience for voice applications.

Happy to provide audio samples, transcript logs, or WebSocket message dumps if helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BidiGenerateContent] Multiple issues with gemini-3.1-flash-live-preview in production voice calls #1197

Summary

Environment

Issue 1: Stuttering/repeating on first turn (VAD self-interruption)

Issue 2: False language switching despite explicit system instruction

Issue 3: `outputTranscription` word splitting across `serverContent` messages

Issue 4: ~3 second cold-start latency on first turn

Production Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BidiGenerateContent] Multiple issues with gemini-3.1-flash-live-preview in production voice calls #1197

Description

Summary

Environment

Issue 1: Stuttering/repeating on first turn (VAD self-interruption)

Issue 2: False language switching despite explicit system instruction

Issue 3: outputTranscription word splitting across serverContent messages

Issue 4: ~3 second cold-start latency on first turn

Production Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Issue 3: `outputTranscription` word splitting across `serverContent` messages