-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
File Name
N/A
What happened?
When using Google Cloud Text-to-Speech API's streaming synthesis with OGG_OPUS audio encoding, Chirp3-HD and Gemini TTS models behave completely differently:
- ✅ Chirp3-HD: Returns proper OGG container format with OPUS codec (as requested)
- ❌ Gemini TTS: Returns LINEAR16 PCM instead, ignoring the encoding parameter
This is a critical bug where Gemini TTS silently ignores the requested audio encoding and returns uncompressed PCM data instead of the requested compressed OGG_OPUS format.
Test Results
Chirp3-HD (Working as Expected)
Packets: 4
Total size: 11,046 bytes
First packet: Starts with 'OggS' ✅
Contains 'OpusHead': Yes ✅
Hex dump (first 20 bytes):
4f67675300020000000000000000000000000000
^OggS (proper OGG magic number)
Gemini TTS (Issue)
Packets: 121
Total size: 231,914 bytes
First packet: Starts with 'OggS' ❌
Contains 'OpusHead': No ❌
Packet size: 1920 bytes (consistent)
Hex dump (first 20 bytes):
08000b00fcffe9ffdfffdbffd2ffc6ffc7ffdfff
(This is LINEAR16 PCM audio data!)
Analysis:
- 1920 bytes per packet = 960 samples × 2 bytes (16-bit)
- 960 samples at 24kHz = 40ms of audio
- Data plays correctly as PCM:
ffplay -f s16le -ar 24000 gemini_output.raw✅ - This is LINEAR16 PCM, NOT OGG_OPUS as requested!
Expected Behavior
When requesting AudioEncoding.OGG_OPUS, the API should return audio data wrapped in an OGG container format with OPUS codec, regardless of which TTS model is used.
Actual Behavior
Gemini TTS models completely ignore the audio encoding parameter and return LINEAR16 PCM data instead. The API silently substitutes a different format without any warning or error.
Configuration Used
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
name="Kore",
model_name="gemini-2.5-flash-preview-tts",
)
streaming_config = texttospeech.StreamingSynthesizeConfig(
voice=voice,
streaming_audio_config=texttospeech.StreamingAudioConfig(
audio_encoding=texttospeech.AudioEncoding.OGG_OPUS,
sample_rate_hertz=24000,
),
)Impact
- Silent format mismatch: Application expects OGG_OPUS but receives LINEAR16 PCM
- Bandwidth waste: LINEAR16 PCM is ~5-10x larger than OPUS compressed audio
- Breaking compatibility: Cannot integrate with systems expecting OGG_OPUS format
- Model-specific code: Requires completely different handling for Gemini vs Chirp3
- No error or warning: API silently returns wrong format without indication
- Cost implications: Larger data transfers increase bandwidth costs and latency
Packet Size Analysis
| Model | Packet Count | Avg Packet Size | Format | Actual Format |
|---|---|---|---|---|
| Chirp3-HD | 4 | 2,761 bytes | Variable (proper OGG pages) | OGG_OPUS ✅ |
| Gemini TTS | 121 | 1,917 bytes | Fixed 1920 bytes (PCM frames) | LINEAR16 ❌ |
The fixed 1920-byte packets from Gemini TTS equal exactly 40ms of LINEAR16 PCM audio (960 samples × 2 bytes at 24kHz). Chirp3-HD's variable packet sizes indicate proper OGG page segmentation with OPUS compression.
Reproduction Steps
- Use the Google Cloud Text-to-Speech API with streaming synthesis
- Configure
AudioEncoding.OGG_OPUSwith 24kHz sample rate - Compare output from
en-US-Chirp3-HD-Charonvsgemini-2.5-flash-preview-tts - Inspect the first packet's magic bytes
Test script: test_chirp3_vs_gemini.py (included in this repository)
To reproduce:
python test_chirp3_vs_gemini.pyThis will generate two files:
test_chirp3_streaming.ogg(proper OGG_OPUS, playable withffplay) ✅test_gemini_streaming.ogg(actually LINEAR16 PCM, play withffplay -f s16le -ar 24000) ❌
Environment
- Library:
google-cloud-texttospeech - API: Google Cloud Text-to-Speech Streaming API
- Models tested:
en-US-Chirp3-HD-Charon✅gemini-2.5-flash-preview-tts
❌
- Python: 3.13.7
- Date: November 10, 2025
Additional Findings
- Gemini TTS does NOT support LINEAR16 encoding: Requesting
AudioEncoding.LINEAR16returns error:400 Unsupported audio encoding - Yet it returns LINEAR16 anyway: Despite not officially supporting it, this is what you actually get when requesting OGG_OPUS
- No documentation: The API doesn't document what encodings Gemini TTS actually supports
- Silent failure: No error, warning, or indication that the requested format is unavailable
Request
Please fix Gemini TTS to either:
- Preferred: Support
OGG_OPUSencoding properly (return OGG containers with OPUS codec), consistent with Chirp3-HD - Alternative: Return a clear error when OGG_OPUS is requested if not supported, rather than silently returning a different format
- Documentation: Clearly document which audio encodings are supported by each model
Additional Notes
- The issue affects real-time streaming scenarios where bandwidth and format compliance are critical
- Standards compliance: RFC 7845 - Ogg Encapsulation for the Opus Audio Codec
- Streaming protocols (WebRTC, etc.) expect proper OGG_OPUS format
- LINEAR16 PCM uses ~10x more bandwidth than OPUS for the same audio quality
Relevant log output
Code of Conduct
- I agree to follow this project's Code of Conduct