Skip to content

Commit 3a1429b

Browse files
committed
Enforce MP3 pass-through and WAV fallback for formats
All non-MP3 audio format requests are now mapped to WAV for predictable compatibility, with MP3 requests yielding MP3 output. Python clients normalize outbound response_format and surface fallback metadata. Web playground and WebSocket demo no longer expose manual format selectors, and documentation clarifies the MP3/WAV behavior. Version bumped to 3.3.0-alpha4.
1 parent ce8fdfb commit 3a1429b

File tree

17 files changed

+185
-141
lines changed

17 files changed

+185
-141
lines changed

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,19 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [3.3.0-alpha4] - 2025-09-19
9+
10+
### Changed
11+
- Enforced MP3 pass-through while mapping all other requested formats to WAV so the service returns predictable audio without failing compatibility checks.
12+
- Python clients now normalise outbound `response_format` payloads to the supported set and surface fallback metadata when a WAV result is returned.
13+
- Docker build workflow tags only `v*` image aliases to avoid duplicate semver tags without the `v` prefix.
14+
15+
### Removed
16+
- Web playground and WebSocket demo no longer expose manual format selectors, reducing confusion around unavoidable WAV fallbacks.
17+
18+
### Documentation
19+
- README (EN/ZH) clarifies the MP3-only guarantee and WAV fallback, and the UI copy was refreshed accordingly.
20+
821
## [3.3.0-alpha3] - 2025-09-18
922

1023
### Added

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ TTSFM provides both synchronous and asynchronous Python clients for text-to-spee
2121
-**Async & Sync** - Both `asyncio` and synchronous clients available
2222
- 🗣️ **11 Voices** - All OpenAI-compatible voices (alloy, echo, fable, onyx, nova, shimmer, etc.)
2323
- 🎵 **6 Audio Formats** - MP3, WAV, OPUS, AAC, FLAC, PCM support
24+
- 🎼 **Format Fallback** - MP3 requests yield MP3; other OpenAI formats map cleanly to WAV for reliable playback
2425
- 🐳 **Docker Ready** - One-command deployment with web interface
2526
- 🌐 **Web Interface** - Interactive playground for testing voices and formats
2627
- 🔧 **CLI Tool** - Command-line interface for quick TTS generation
@@ -184,6 +185,9 @@ combined = client.generate_speech_long_text(
184185
)
185186

186187
combined.save_to_file("long_text") # Saves as long_text.mp3
188+
189+
# Note: Only MP3 requests return MP3 data. Other formats (OPUS/AAC/FLAC/WAV/PCM)
190+
# are delivered as WAV while remaining API-compatible.
187191
```
188192

189193
#### OpenAI Python Client Compatibility
@@ -270,6 +274,9 @@ ttsfm "Hello, world!" --url http://localhost:7000 --output hello.mp3
270274
# Auto-combine long text into a single file
271275
ttsfm --text-file article.txt --output article.mp3 --split-long-text --auto-combine
272276

277+
> **Heads-up:** The CLI accepts all OpenAI-compatible format options, but anything
278+
> other than `mp3` will be delivered as WAV by the free upstream service.
279+
273280
# List available voices
274281
ttsfm --list-voices
275282

README.zh.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ TTSFM为文本转语音生成提供同步和异步Python客户端,使用逆向
2121
-**异步和同步** - 提供`asyncio`和同步客户端
2222
- 🗣️ **11种声音** - 所有OpenAI兼容的声音(alloy、echo、fable、onyx、nova、shimmer等)
2323
- 🎵 **6种音频格式** - 支持MP3、WAV、OPUS、AAC、FLAC、PCM
24+
- 🎼 **格式回退** - 请求MP3时输出MP3;其他OpenAI格式会安全回退为WAV,保证兼容性
2425
- 🐳 **Docker就绪** - 一键部署,包含Web界面
2526
- 🌐 **Web界面** - 用于测试声音和格式的交互式试用平台
2627
- 🔧 **CLI工具** - 用于快速TTS生成的命令行界面
@@ -173,6 +174,9 @@ combined = client.generate_speech_long_text(
173174
)
174175

175176
combined.save_to_file("long_text") # 保存为 long_text.mp3
177+
178+
# 提示:只有 MP3 请求会返回 MP3 数据,其余格式(OPUS/AAC/FLAC/WAV/PCM)
179+
# 会回退为 WAV,以确保兼容免费上游服务。
176180
```
177181

178182
#### OpenAI Python客户端兼容性
@@ -264,6 +268,9 @@ ttsfm --list-voices
264268

265269
# 获取帮助
266270
ttsfm --help
271+
272+
> **提示:** CLI 仍然接受所有 OpenAI 兼容格式参数,但除了 `mp3` 之外的选项都会
273+
> 回退为 WAV,这与免费上游服务的行为一致。
267274
```
268275

269276
## ⚙️ 配置

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ ttsfm = "ttsfm.cli:main"
8686
version_scheme = "no-guess-dev"
8787
local_scheme = "no-local-version"
8888

89-
fallback_version = "3.3.0-alpha3"
89+
fallback_version = "3.3.0-alpha4"
9090
[tool.setuptools]
9191
packages = ["ttsfm"]
9292

tests/test_clients.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import pytest
2+
import types
23

34
from ttsfm.client import TTSClient
45
from ttsfm.async_client import AsyncTTSClient
@@ -14,6 +15,50 @@ def _mk_response(data: bytes) -> TTSResponse:
1415
)
1516

1617

18+
class _DummyResponse:
19+
def __init__(self, content_type: str, content: bytes, url: str = "https://example.test/audio"):
20+
self.status_code = 200
21+
self.headers = {"content-type": content_type}
22+
self.content = content
23+
self.url = url
24+
self.text = ""
25+
26+
def json(self): # pragma: no cover - not used on success path
27+
return {}
28+
29+
30+
def test_sync_request_normalizes_non_mp3_format(monkeypatch):
31+
client = TTSClient()
32+
captured = {}
33+
34+
def fake_post(self, url, data=None, headers=None, timeout=None, verify=None):
35+
captured["data"] = data
36+
return _DummyResponse("audio/wav", b"RIFF" + b"\x00" * 64, url)
37+
38+
monkeypatch.setattr(client.session, "post", types.MethodType(fake_post, client.session))
39+
40+
response = client.generate_speech(text="hello", voice="alloy", response_format=AudioFormat.FLAC)
41+
42+
assert captured["data"]["response_format"] == "wav"
43+
assert response.format is AudioFormat.WAV
44+
45+
46+
def test_sync_request_preserves_mp3_format(monkeypatch):
47+
client = TTSClient()
48+
captured = {}
49+
50+
def fake_post(self, url, data=None, headers=None, timeout=None, verify=None):
51+
captured["data"] = data
52+
return _DummyResponse("audio/mpeg", b"ID3" + b"\x00" * 64, url)
53+
54+
monkeypatch.setattr(client.session, "post", types.MethodType(fake_post, client.session))
55+
56+
response = client.generate_speech(text="hello", voice="alloy", response_format=AudioFormat.MP3)
57+
58+
assert captured["data"]["response_format"] == "mp3"
59+
assert response.format is AudioFormat.MP3
60+
61+
1762
def test_sync_long_text_auto_combine(monkeypatch):
1863
client = TTSClient()
1964

tests/test_web_app.py

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -74,17 +74,12 @@ def from_wav(cls, buffer):
7474
cls.formats.append("wav")
7575
return DummySegment("wav")
7676

77-
@classmethod
78-
def from_file(cls, buffer, format: str):
79-
cls.formats.append(format)
80-
return DummySegment(format)
81-
8277
monkeypatch.setattr(audio_module, "AudioSegment", DummyAudioSegment)
8378

8479
output = audio_module.combine_audio_chunks([b"one", b"two"], "opus")
8580

86-
assert output == b"opus:opusopus"
87-
assert DummyAudioSegment.formats == ["opus", "opus"]
81+
assert output == b"wav:wavwav"
82+
assert DummyAudioSegment.formats == ["wav", "wav"]
8883

8984

9085
@pytest.mark.parametrize('header_name, header_value', [

ttsfm-web/app.py

Lines changed: 49 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
# Import the TTSFM package
3030
try:
3131
from ttsfm import TTSClient, Voice, AudioFormat, TTSException
32+
from ttsfm.models import get_supported_format
3233
from ttsfm.audio import combine_audio_chunks
3334
from ttsfm.exceptions import APIException, NetworkException, ValidationException
3435
from ttsfm.utils import validate_text_length, split_text_by_length
@@ -37,6 +38,7 @@
3738
import sys
3839
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
3940
from ttsfm import TTSClient, Voice, AudioFormat, TTSException
41+
from ttsfm.models import get_supported_format
4042
from ttsfm.audio import combine_audio_chunks
4143
from ttsfm.exceptions import APIException, NetworkException, ValidationException
4244
from ttsfm.utils import validate_text_length, split_text_by_length
@@ -488,7 +490,12 @@ def generate_speech():
488490
"error": f"Invalid format: {response_format}. Must be one of: {[f.value for f in AudioFormat]}"
489491
}), 400
490492

491-
logger.info(f"Generating speech: text='{text[:50]}...', voice={voice}, format={response_format}")
493+
effective_format = get_supported_format(format_enum)
494+
495+
logger.info(
496+
"Generating speech: text='%s...', voice=%s, requested_format=%s (effective=%s)",
497+
text[:50], voice, response_format, effective_format.value
498+
)
492499

493500
client = create_tts_client()
494501
response = client.generate_speech(
@@ -503,7 +510,9 @@ def generate_speech():
503510
headers = {
504511
'Content-Disposition': f'attachment; filename="speech.{response.format.value}"',
505512
'X-Audio-Format': response.format.value,
506-
'X-Audio-Size': str(response.size)
513+
'X-Audio-Size': str(response.size),
514+
'X-Requested-Format': format_enum.value,
515+
'X-Effective-Format': effective_format.value
507516
}
508517

509518
return Response(
@@ -559,16 +568,17 @@ def generate_speech_combined():
559568
if not text:
560569
return jsonify({"error": "Text is required"}), 400
561570

571+
try:
572+
voice_enum = Voice(voice.lower())
573+
format_enum = AudioFormat(response_format.lower())
574+
except ValueError as e:
575+
logger.warning(f"Invalid voice or format: {e}")
576+
return jsonify({"error": "Invalid voice or format specified"}), 400
577+
578+
effective_format = get_supported_format(format_enum)
579+
562580
# Check if text needs splitting
563581
if len(text) <= max_length:
564-
# Text is short enough, use regular generation
565-
try:
566-
voice_enum = Voice(voice.lower())
567-
format_enum = AudioFormat(response_format.lower())
568-
except ValueError as e:
569-
logger.warning(f"Invalid voice or format: {e}")
570-
return jsonify({"error": "Invalid voice or format specified"}), 400
571-
572582
client = create_tts_client()
573583

574584
response = client.generate_speech(
@@ -584,7 +594,9 @@ def generate_speech_combined():
584594
'Content-Disposition': f'attachment; filename="combined_speech.{response.format.value}"',
585595
'X-Audio-Format': response.format.value,
586596
'X-Audio-Size': str(response.size),
587-
'X-Chunks-Combined': '1'
597+
'X-Chunks-Combined': '1',
598+
'X-Requested-Format': format_enum.value,
599+
'X-Effective-Format': effective_format.value
588600
}
589601

590602
return Response(
@@ -626,11 +638,12 @@ def generate_speech_combined():
626638
logger.info(f"Generated {len(responses)} chunks, combining into single audio file")
627639

628640
# Extract audio data from responses
629-
audio_chunks = [response.audio_data for response in responses]
641+
audio_chunks = [resp.audio_data for resp in responses]
630642

631643
# Combine audio chunks
632644
try:
633-
combined_audio = combine_audio_chunks(audio_chunks, format_enum.value)
645+
actual_format = responses[0].format
646+
combined_audio = combine_audio_chunks(audio_chunks, actual_format.value)
634647
except Exception as e:
635648
logger.error(f"Failed to combine audio chunks: {e}")
636649
return jsonify({"error": "Failed to combine audio chunks"}), 500
@@ -644,11 +657,13 @@ def generate_speech_combined():
644657
logger.info(f"Successfully combined {len(responses)} chunks into single audio file ({len(combined_audio)} bytes)")
645658

646659
combined_headers = {
647-
'Content-Disposition': f'attachment; filename="combined_speech.{format_enum.value}"',
648-
'X-Audio-Format': format_enum.value,
660+
'Content-Disposition': f'attachment; filename="combined_speech.{actual_format.value}"',
661+
'X-Audio-Format': actual_format.value,
649662
'X-Audio-Size': str(len(combined_audio)),
650663
'X-Chunks-Combined': str(len(responses)),
651-
'X-Original-Text-Length': str(len(text))
664+
'X-Original-Text-Length': str(len(text)),
665+
'X-Requested-Format': format_enum.value,
666+
'X-Effective-Format': get_supported_format(format_enum).value
652667
}
653668

654669
return Response(
@@ -699,7 +714,7 @@ def get_status():
699714
return jsonify({
700715
"status": "online",
701716
"tts_service": "openai.fm (free)",
702-
"package_version": "3.3.0-alpha3",
717+
"package_version": "3.3.0-alpha4",
703718
"timestamp": datetime.now().isoformat()
704719
})
705720

@@ -717,7 +732,7 @@ def health_check():
717732
"""Simple health check endpoint."""
718733
return jsonify({
719734
"status": "healthy",
720-
"package_version": "3.3.0-alpha3",
735+
"package_version": "3.3.0-alpha4",
721736
"timestamp": datetime.now().isoformat()
722737
})
723738

@@ -818,7 +833,12 @@ def openai_speech():
818833
}
819834
}), 400
820835

821-
logger.info(f"OpenAI API: Generating speech: text='{input_text[:50]}...', voice={voice}, format={response_format}, auto_combine={auto_combine}")
836+
effective_format = get_supported_format(format_enum)
837+
838+
logger.info(
839+
"OpenAI API: Generating speech: text='%s...', voice=%s, requested_format=%s (effective=%s), auto_combine=%s",
840+
input_text[:50], voice, response_format, effective_format.value, auto_combine
841+
)
822842

823843
client = create_tts_client()
824844

@@ -847,8 +867,9 @@ def openai_speech():
847867
}), 400
848868

849869
# Extract audio data and combine
850-
audio_chunks = [response.audio_data for response in responses]
851-
combined_audio = combine_audio_chunks(audio_chunks, format_enum.value)
870+
audio_chunks = [resp.audio_data for resp in responses]
871+
actual_format = responses[0].format
872+
combined_audio = combine_audio_chunks(audio_chunks, actual_format.value)
852873

853874
if not combined_audio:
854875
return jsonify({
@@ -865,12 +886,14 @@ def openai_speech():
865886

866887
headers = {
867888
'Content-Type': content_type,
868-
'X-Audio-Format': format_enum.value,
889+
'X-Audio-Format': actual_format.value,
869890
'X-Audio-Size': str(len(combined_audio)),
870891
'X-Chunks-Combined': str(len(responses)),
871892
'X-Original-Text-Length': str(len(input_text)),
872893
'X-Auto-Combine': 'true',
873-
'X-Powered-By': 'TTSFM-OpenAI-Compatible'
894+
'X-Powered-By': 'TTSFM-OpenAI-Compatible',
895+
'X-Requested-Format': format_enum.value,
896+
'X-Effective-Format': effective_format.value
874897
}
875898

876899
return Response(
@@ -908,7 +931,9 @@ def openai_speech():
908931
'X-Audio-Size': str(response.size),
909932
'X-Chunks-Combined': '1',
910933
'X-Auto-Combine': str(auto_combine).lower(),
911-
'X-Powered-By': 'TTSFM-OpenAI-Compatible'
934+
'X-Powered-By': 'TTSFM-OpenAI-Compatible',
935+
'X-Requested-Format': format_enum.value,
936+
'X-Effective-Format': effective_format.value
912937
}
913938

914939
return Response(

ttsfm-web/static/js/playground-enhanced-fixed.js

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,12 @@ const PlaygroundApp = (() => {
3939

4040
checkAuthStatus();
4141
loadVoices();
42-
loadFormats();
42+
43+
if (document.getElementById('format-select')) {
44+
loadFormats();
45+
} else {
46+
state.format = 'mp3';
47+
}
4348
updateCharCount();
4449
updateAudioSummary();
4550
updateActionButtons(false);
@@ -50,6 +55,10 @@ const PlaygroundApp = (() => {
5055
els.textInput = document.getElementById('text-input');
5156
els.voiceSelect = document.getElementById('voice-select');
5257
els.formatSelect = document.getElementById('format-select');
58+
if (!els.formatSelect) {
59+
els.formatSelect = document.createElement('select');
60+
els.formatSelect.value = state.format;
61+
}
5362
els.instructionsInput = document.getElementById('instructions-input');
5463
els.apiKeyInput = document.getElementById('api-key-input');
5564
els.maxLengthInput = document.getElementById('max-length-input');

0 commit comments

Comments
 (0)