Skip to content

[Bug]: [Gemini TTS] Streaming Returns LINEAR16 PCM When OGG_OPUS is Requested #2480

@IPROTAGON1ST

Description

@IPROTAGON1ST

File Name

N/A

What happened?

When using Google Cloud Text-to-Speech API's streaming synthesis with OGG_OPUS audio encoding, Chirp3-HD and Gemini TTS models behave completely differently:

  • Chirp3-HD: Returns proper OGG container format with OPUS codec (as requested)
  • Gemini TTS: Returns LINEAR16 PCM instead, ignoring the encoding parameter

This is a critical bug where Gemini TTS silently ignores the requested audio encoding and returns uncompressed PCM data instead of the requested compressed OGG_OPUS format.

Test Results

Chirp3-HD (Working as Expected)

Packets: 4
Total size: 11,046 bytes
First packet: Starts with 'OggS' ✅
Contains 'OpusHead': Yes ✅

Hex dump (first 20 bytes):

4f67675300020000000000000000000000000000
^OggS (proper OGG magic number)

Gemini TTS (Issue)

Packets: 121
Total size: 231,914 bytes
First packet: Starts with 'OggS' ❌
Contains 'OpusHead': No ❌
Packet size: 1920 bytes (consistent)

Hex dump (first 20 bytes):

08000b00fcffe9ffdfffdbffd2ffc6ffc7ffdfff
(This is LINEAR16 PCM audio data!)

Analysis:

  • 1920 bytes per packet = 960 samples × 2 bytes (16-bit)
  • 960 samples at 24kHz = 40ms of audio
  • Data plays correctly as PCM: ffplay -f s16le -ar 24000 gemini_output.raw
  • This is LINEAR16 PCM, NOT OGG_OPUS as requested!

Expected Behavior

When requesting AudioEncoding.OGG_OPUS, the API should return audio data wrapped in an OGG container format with OPUS codec, regardless of which TTS model is used.

Actual Behavior

Gemini TTS models completely ignore the audio encoding parameter and return LINEAR16 PCM data instead. The API silently substitutes a different format without any warning or error.

Configuration Used

voice = texttospeech.VoiceSelectionParams(
    language_code="en-US",
    name="Kore",
    model_name="gemini-2.5-flash-preview-tts",
)

streaming_config = texttospeech.StreamingSynthesizeConfig(
    voice=voice,
    streaming_audio_config=texttospeech.StreamingAudioConfig(
        audio_encoding=texttospeech.AudioEncoding.OGG_OPUS,
        sample_rate_hertz=24000,
    ),
)

Impact

  1. Silent format mismatch: Application expects OGG_OPUS but receives LINEAR16 PCM
  2. Bandwidth waste: LINEAR16 PCM is ~5-10x larger than OPUS compressed audio
  3. Breaking compatibility: Cannot integrate with systems expecting OGG_OPUS format
  4. Model-specific code: Requires completely different handling for Gemini vs Chirp3
  5. No error or warning: API silently returns wrong format without indication
  6. Cost implications: Larger data transfers increase bandwidth costs and latency

Packet Size Analysis

Model Packet Count Avg Packet Size Format Actual Format
Chirp3-HD 4 2,761 bytes Variable (proper OGG pages) OGG_OPUS ✅
Gemini TTS 121 1,917 bytes Fixed 1920 bytes (PCM frames) LINEAR16 ❌

The fixed 1920-byte packets from Gemini TTS equal exactly 40ms of LINEAR16 PCM audio (960 samples × 2 bytes at 24kHz). Chirp3-HD's variable packet sizes indicate proper OGG page segmentation with OPUS compression.

Reproduction Steps

  1. Use the Google Cloud Text-to-Speech API with streaming synthesis
  2. Configure AudioEncoding.OGG_OPUS with 24kHz sample rate
  3. Compare output from en-US-Chirp3-HD-Charon vs gemini-2.5-flash-preview-tts
  4. Inspect the first packet's magic bytes

Test script: test_chirp3_vs_gemini.py (included in this repository)

To reproduce:

python test_chirp3_vs_gemini.py

test_chirp3_vs_gemini.py

This will generate two files:

  • test_chirp3_streaming.ogg (proper OGG_OPUS, playable with ffplay) ✅
  • test_gemini_streaming.ogg (actually LINEAR16 PCM, play with ffplay -f s16le -ar 24000) ❌

Environment

  • Library: google-cloud-texttospeech
  • API: Google Cloud Text-to-Speech Streaming API
  • Models tested:
    • en-US-Chirp3-HD-Charon
    • gemini-2.5-flash-preview-tts

  • Python: 3.13.7
  • Date: November 10, 2025

Additional Findings

  • Gemini TTS does NOT support LINEAR16 encoding: Requesting AudioEncoding.LINEAR16 returns error: 400 Unsupported audio encoding
  • Yet it returns LINEAR16 anyway: Despite not officially supporting it, this is what you actually get when requesting OGG_OPUS
  • No documentation: The API doesn't document what encodings Gemini TTS actually supports
  • Silent failure: No error, warning, or indication that the requested format is unavailable

Request

Please fix Gemini TTS to either:

  1. Preferred: Support OGG_OPUS encoding properly (return OGG containers with OPUS codec), consistent with Chirp3-HD
  2. Alternative: Return a clear error when OGG_OPUS is requested if not supported, rather than silently returning a different format
  3. Documentation: Clearly document which audio encodings are supported by each model

Additional Notes

  • The issue affects real-time streaming scenarios where bandwidth and format compliance are critical
  • Standards compliance: RFC 7845 - Ogg Encapsulation for the Opus Audio Codec
  • Streaming protocols (WebRTC, etc.) expect proper OGG_OPUS format
  • LINEAR16 PCM uses ~10x more bandwidth than OPUS for the same audio quality

Relevant log output

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions