[Bug]: [Gemini TTS] Streaming Returns LINEAR16 PCM When OGG_OPUS is Requested

### File Name

N/A

### What happened?

When using Google Cloud Text-to-Speech API's streaming synthesis with `OGG_OPUS` audio encoding, **Chirp3-HD** and **Gemini TTS** models behave completely differently:

- ✅ **Chirp3-HD**: Returns proper OGG container format with OPUS codec (as requested)
- ❌ **Gemini TTS**: Returns **LINEAR16 PCM** instead, ignoring the encoding parameter

This is a critical bug where Gemini TTS **silently ignores the requested audio encoding** and returns uncompressed PCM data instead of the requested compressed OGG_OPUS format.

## Test Results

### Chirp3-HD (Working as Expected)

```
Packets: 4
Total size: 11,046 bytes
First packet: Starts with 'OggS' ✅
Contains 'OpusHead': Yes ✅
```

**Hex dump (first 20 bytes):**

```
4f67675300020000000000000000000000000000
^OggS (proper OGG magic number)
```

### Gemini TTS (Issue)

```
Packets: 121
Total size: 231,914 bytes
First packet: Starts with 'OggS' ❌
Contains 'OpusHead': No ❌
Packet size: 1920 bytes (consistent)
```

**Hex dump (first 20 bytes):**

```
08000b00fcffe9ffdfffdbffd2ffc6ffc7ffdfff
(This is LINEAR16 PCM audio data!)
```

**Analysis:**

- 1920 bytes per packet = 960 samples × 2 bytes (16-bit)
- 960 samples at 24kHz = 40ms of audio
- Data plays correctly as PCM: `ffplay -f s16le -ar 24000 gemini_output.raw` ✅
- This is **LINEAR16 PCM**, NOT OGG_OPUS as requested!

## Expected Behavior

When requesting `AudioEncoding.OGG_OPUS`, the API should return audio data wrapped in an OGG container format with OPUS codec, regardless of which TTS model is used.

## Actual Behavior

Gemini TTS models **completely ignore the audio encoding parameter** and return LINEAR16 PCM data instead. The API silently substitutes a different format without any warning or error.

## Configuration Used

```python
voice = texttospeech.VoiceSelectionParams(
    language_code="en-US",
    name="Kore",
    model_name="gemini-2.5-flash-preview-tts",
)

streaming_config = texttospeech.StreamingSynthesizeConfig(
    voice=voice,
    streaming_audio_config=texttospeech.StreamingAudioConfig(
        audio_encoding=texttospeech.AudioEncoding.OGG_OPUS,
        sample_rate_hertz=24000,
    ),
)
```

## Impact

1. **Silent format mismatch**: Application expects OGG_OPUS but receives LINEAR16 PCM
2. **Bandwidth waste**: LINEAR16 PCM is ~5-10x larger than OPUS compressed audio
3. **Breaking compatibility**: Cannot integrate with systems expecting OGG_OPUS format
4. **Model-specific code**: Requires completely different handling for Gemini vs Chirp3
5. **No error or warning**: API silently returns wrong format without indication
6. **Cost implications**: Larger data transfers increase bandwidth costs and latency

## Packet Size Analysis

| Model      | Packet Count | Avg Packet Size | Format                        | Actual Format |
| ---------- | ------------ | --------------- | ----------------------------- | ------------- |
| Chirp3-HD  | 4            | 2,761 bytes     | Variable (proper OGG pages)   | OGG_OPUS ✅   |
| Gemini TTS | 121          | 1,917 bytes     | Fixed 1920 bytes (PCM frames) | LINEAR16 ❌   |

The fixed 1920-byte packets from Gemini TTS equal exactly 40ms of LINEAR16 PCM audio (960 samples × 2 bytes at 24kHz). Chirp3-HD's variable packet sizes indicate proper OGG page segmentation with OPUS compression.

## Reproduction Steps

1. Use the Google Cloud Text-to-Speech API with streaming synthesis
2. Configure `AudioEncoding.OGG_OPUS` with 24kHz sample rate
3. Compare output from `en-US-Chirp3-HD-Charon` vs `gemini-2.5-flash-preview-tts`
4. Inspect the first packet's magic bytes

**Test script:** `test_chirp3_vs_gemini.py` (included in this repository)

To reproduce:

```bash
python test_chirp3_vs_gemini.py
```

[test_chirp3_vs_gemini.py](https://github.com/user-attachments/files/23442894/test_chirp3_vs_gemini.py)

This will generate two files:

- `test_chirp3_streaming.ogg` (proper OGG_OPUS, playable with `ffplay`) ✅
- `test_gemini_streaming.ogg` (actually LINEAR16 PCM, play with `ffplay -f s16le -ar 24000`) ❌

## Environment

- **Library**: `google-cloud-texttospeech`
- **API**: Google Cloud Text-to-Speech Streaming API
- **Models tested**:
  - `en-US-Chirp3-HD-Charon` ✅
  - `gemini-2.5-flash-preview-tts`

 ❌
- **Python**: 3.13.7
- **Date**: November 10, 2025

## Additional Findings

- **Gemini TTS does NOT support LINEAR16 encoding**: Requesting `AudioEncoding.LINEAR16` returns error: `400 Unsupported audio encoding`
- **Yet it returns LINEAR16 anyway**: Despite not officially supporting it, this is what you actually get when requesting OGG_OPUS
- **No documentation**: The API doesn't document what encodings Gemini TTS actually supports
- **Silent failure**: No error, warning, or indication that the requested format is unavailable

## Request

Please fix Gemini TTS to either:

1. **Preferred**: Support `OGG_OPUS` encoding properly (return OGG containers with OPUS codec), consistent with Chirp3-HD
2. **Alternative**: Return a clear error when OGG_OPUS is requested if not supported, rather than silently returning a different format
3. **Documentation**: Clearly document which audio encodings are supported by each model

## Additional Notes

- The issue affects real-time streaming scenarios where bandwidth and format compliance are critical
- Standards compliance: [RFC 7845 - Ogg Encapsulation for the Opus Audio Codec](https://datatracker.ietf.org/doc/html/rfc7845)
- Streaming protocols (WebRTC, etc.) expect proper OGG_OPUS format
- LINEAR16 PCM uses ~10x more bandwidth than OPUS for the same audio quality


### Relevant log output

```shell

```

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: [Gemini TTS] Streaming Returns LINEAR16 PCM When OGG_OPUS is Requested #2480

File Name

What happened?

Test Results

Chirp3-HD (Working as Expected)

Gemini TTS (Issue)

Expected Behavior

Actual Behavior

Configuration Used

Impact

Packet Size Analysis

Reproduction Steps

Environment

Additional Findings

Request

Additional Notes

Relevant log output

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Packet Count	Avg Packet Size	Format	Actual Format
Chirp3-HD	4	2,761 bytes	Variable (proper OGG pages)	OGG_OPUS ✅
Gemini TTS	121	1,917 bytes	Fixed 1920 bytes (PCM frames)	LINEAR16 ❌

[Bug]: [Gemini TTS] Streaming Returns LINEAR16 PCM When OGG_OPUS is Requested #2480

Description

File Name

What happened?

Test Results

Chirp3-HD (Working as Expected)

Gemini TTS (Issue)

Expected Behavior

Actual Behavior

Configuration Used

Impact

Packet Size Analysis

Reproduction Steps

Environment

Additional Findings

Request

Additional Notes

Relevant log output

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions