Skip to content

Commit 9f64f30

Browse files
committed
Add centralized audio combining and auto-combine
Introduces ttsfm/audio.py with reusable audio chunk combining logic and a new combine_responses helper. Adds auto_combine support to both sync and async clients and CLI, enabling single-file output for long text. Updates documentation and tests to cover the new behavior, and bumps version to 3.3.0-alpha3.
1 parent 4e601a8 commit 9f64f30

File tree

13 files changed

+316
-119
lines changed

13 files changed

+316
-119
lines changed

CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,22 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [3.3.0-alpha3] - 2025-09-18
9+
10+
### Added
11+
- Centralised audio chunk combining in `ttsfm/audio.py`, including the reusable `combine_responses` helper for both core and web flows.
12+
- `auto_combine=True` support in the synchronous/asynchronous clients and CLI delivers a single audio file for long text (pydub still optional for non-WAV output).
13+
- Regression tests (`tests/test_clients.py`) covering the new combination paths.
14+
15+
### Changed
16+
- Long-text splitting now falls back to word-level chunks with a small tolerance so punctuation stays intact while respecting `max_length` limits.
17+
18+
### Documentation
19+
- README (EN/ZH) highlights the Python auto-combine option and CLI flag; `AI_NOTES.md` captures the refreshed test instructions.
20+
21+
### Testing
22+
- Added regression coverage for the audio helper refactor and client auto-combine behaviour; `pytest` commands documented for follow-up runs.
23+
824
## [3.3.0-alpha2] - 2025-09-18
925

1026
### Changed

README.md

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ TTSFM provides both synchronous and asynchronous Python clients for text-to-spee
2626
- 🔧 **CLI Tool** - Command-line interface for quick TTS generation
2727
- 📦 **Type Hints** - Full type annotation support for better IDE experience
2828
- 🛡️ **Error Handling** - Comprehensive exception hierarchy with retry logic
29-
-**Auto-Combine** (Web API) - Docker/OpenAI-compatible endpoint can split and merge long text for you
29+
-**Auto-Combine** - Web/OpenAI endpoints merge long text automatically; Python client can opt-in with `auto_combine=True`
3030
- 📊 **Text Validation** - Automatic text length validation and splitting
3131
- 🔐 **API Key Protection** - Optional OpenAI-compatible authentication for secure deployments
3232

@@ -168,11 +168,22 @@ responses = client.generate_speech_long_text(
168168
preserve_words=True
169169
)
170170

171-
# Save each chunk as separate files
172171
for i, response in enumerate(responses, 1):
173-
response.save_to_file(f"part_{i:03d}") # Saves as part_001.mp3, part_002.mp3, etc.
172+
response.save_to_file(f"part_{i:03d}")
174173

175174
print(f"Generated {len(responses)} audio files from long text")
175+
176+
# Or combine everything into a single response (requires pydub for non-WAV formats)
177+
combined = client.generate_speech_long_text(
178+
text="Very long text that exceeds 4096 characters...",
179+
voice=Voice.ALLOY,
180+
response_format=AudioFormat.MP3,
181+
max_length=2000,
182+
preserve_words=True,
183+
auto_combine=True,
184+
)
185+
186+
combined.save_to_file("long_text") # Saves as long_text.mp3
176187
```
177188

178189
#### OpenAI Python Client Compatibility
@@ -256,6 +267,9 @@ ttsfm --text-file input.txt --output speech.mp3
256267
# Custom service URL
257268
ttsfm "Hello, world!" --url http://localhost:7000 --output hello.mp3
258269

270+
# Auto-combine long text into a single file
271+
ttsfm --text-file article.txt --output article.mp3 --split-long-text --auto-combine
272+
259273
# List available voices
260274
ttsfm --list-voices
261275

README.zh.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ TTSFM为文本转语音生成提供同步和异步Python客户端,使用逆向
2626
- 🔧 **CLI工具** - 用于快速TTS生成的命令行界面
2727
- 📦 **类型提示** - 完整的类型注解支持,提供更好的IDE体验
2828
- 🛡️ **错误处理** - 全面的异常层次结构和重试逻辑
29-
-**自动合并(Web API)** - Docker / OpenAI 兼容端点可自动分割并合并长文本
29+
-**自动合并** - Web/OpenAI 端点自动处理长文本;Python 客户端可通过 `auto_combine=True` 合并音频
3030
- 📊 **文本验证** - 自动文本长度验证和分割
3131
- 🔐 **API密钥保护** - 可选的OpenAI兼容身份验证,用于安全部署
3232

@@ -161,6 +161,18 @@ for i, response in enumerate(responses, 1):
161161
response.save_to_file(f"part_{i:03d}") # 保存为part_001.mp3、part_002.mp3等
162162

163163
print(f"从长文本生成了 {len(responses)} 个音频文件")
164+
165+
# 或合并为单个音频(非WAV格式需要安装pydub)
166+
combined = client.generate_speech_long_text(
167+
text="超过4096字符的很长文本...",
168+
voice=Voice.ALLOY,
169+
response_format=AudioFormat.MP3,
170+
max_length=2000,
171+
preserve_words=True,
172+
auto_combine=True
173+
)
174+
175+
combined.save_to_file("long_text") # 保存为 long_text.mp3
164176
```
165177

166178
#### OpenAI Python客户端兼容性
@@ -244,6 +256,9 @@ ttsfm --text-file input.txt --output speech.mp3
244256
# 自定义服务URL
245257
ttsfm "你好,世界!" --url http://localhost:7000 --output hello.mp3
246258

259+
# 自动合并长文本并生成单个音频
260+
ttsfm --text-file article.txt --output article.mp3 --split-long-text --auto-combine
261+
247262
# 列出可用声音
248263
ttsfm --list-voices
249264

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ ttsfm = "ttsfm.cli:main"
8686
version_scheme = "no-guess-dev"
8787
local_scheme = "no-local-version"
8888

89-
fallback_version = "3.3.0-alpha2"
89+
fallback_version = "3.3.0-alpha3"
9090
[tool.setuptools]
9191
packages = ["ttsfm"]
9292

tests/test_clients.py

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
import pytest
2+
3+
from ttsfm.client import TTSClient
4+
from ttsfm.async_client import AsyncTTSClient
5+
from ttsfm.models import TTSResponse, AudioFormat
6+
7+
8+
def _mk_response(data: bytes) -> TTSResponse:
9+
return TTSResponse(
10+
audio_data=data,
11+
content_type="audio/mpeg",
12+
format=AudioFormat.MP3,
13+
size=len(data),
14+
)
15+
16+
17+
def test_sync_long_text_auto_combine(monkeypatch):
18+
client = TTSClient()
19+
20+
monkeypatch.setattr(
21+
client,
22+
"generate_speech_batch",
23+
lambda **kwargs: [_mk_response(b"one"), _mk_response(b"two")],
24+
)
25+
26+
combined_flag = {}
27+
28+
def fake_combine(responses):
29+
combined_flag["called"] = True
30+
return _mk_response(b"onetwo")
31+
32+
monkeypatch.setattr("ttsfm.client.combine_responses", fake_combine)
33+
34+
result = client.generate_speech_long_text(
35+
text="dummy",
36+
auto_combine=True,
37+
)
38+
39+
assert combined_flag["called"] is True
40+
assert isinstance(result, TTSResponse)
41+
assert result.audio_data == b"onetwo"
42+
43+
44+
def test_sync_long_text_returns_list_without_auto_combine(monkeypatch):
45+
client = TTSClient()
46+
47+
responses = [_mk_response(b"one")]
48+
monkeypatch.setattr(client, "generate_speech_batch", lambda **_: responses)
49+
50+
result = client.generate_speech_long_text(text="dummy", auto_combine=False)
51+
52+
assert result is responses
53+
54+
55+
@pytest.mark.asyncio
56+
async def test_async_long_text_auto_combine(monkeypatch):
57+
client = AsyncTTSClient()
58+
59+
async def fake_batch(**kwargs):
60+
return [_mk_response(b"one"), _mk_response(b"two")]
61+
62+
monkeypatch.setattr(client, "generate_speech_batch", fake_batch)
63+
64+
def fake_combine(responses):
65+
return _mk_response(b"onetwo")
66+
67+
monkeypatch.setattr("ttsfm.async_client.combine_responses", fake_combine)
68+
69+
result = await client.generate_speech_long_text(
70+
text="dummy",
71+
auto_combine=True,
72+
)
73+
74+
assert isinstance(result, TTSResponse)
75+
assert result.audio_data == b"onetwo"

tests/test_web_app.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,9 @@ def test_voices_endpoint_returns_data(monkeypatch):
4646

4747

4848
def test_combine_audio_chunks_uses_format_hint(monkeypatch):
49-
module = load_web_app(monkeypatch, REQUIRE_API_KEY='false', TTSFM_API_KEY=None)
49+
load_web_app(monkeypatch, REQUIRE_API_KEY='false', TTSFM_API_KEY=None)
50+
51+
from ttsfm import audio as audio_module
5052

5153
class DummySegment:
5254
def __init__(self, tag: str):
@@ -77,9 +79,9 @@ def from_file(cls, buffer, format: str):
7779
cls.formats.append(format)
7880
return DummySegment(format)
7981

80-
monkeypatch.setattr(module, "AudioSegment", DummyAudioSegment)
82+
monkeypatch.setattr(audio_module, "AudioSegment", DummyAudioSegment)
8183

82-
output = module.combine_audio_chunks([b"one", b"two"], "opus")
84+
output = audio_module.combine_audio_chunks([b"one", b"two"], "opus")
8385

8486
assert output == b"opus:opusopus"
8587
assert DummyAudioSegment.formats == ["opus", "opus"]

ttsfm-web/app.py

Lines changed: 4 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -29,13 +29,15 @@
2929
# Import the TTSFM package
3030
try:
3131
from ttsfm import TTSClient, Voice, AudioFormat, TTSException
32+
from ttsfm.audio import combine_audio_chunks
3233
from ttsfm.exceptions import APIException, NetworkException, ValidationException
3334
from ttsfm.utils import validate_text_length, split_text_by_length
3435
except ImportError:
3536
# Fallback for development when package is not installed
3637
import sys
3738
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
3839
from ttsfm import TTSClient, Voice, AudioFormat, TTSException
40+
from ttsfm.audio import combine_audio_chunks
3941
from ttsfm.exceptions import APIException, NetworkException, ValidationException
4042
from ttsfm.utils import validate_text_length, split_text_by_length
4143

@@ -265,96 +267,6 @@ def _chunk_bytes(data: bytes, chunk_size: int = 64 * 1024) -> Iterator[bytes]:
265267
yield bytes(view[offset:offset + chunk_size])
266268

267269

268-
try:
269-
from pydub import AudioSegment # type: ignore
270-
except ImportError: # pragma: no cover - optional dependency
271-
AudioSegment = None # type: ignore
272-
273-
274-
def combine_audio_chunks(audio_chunks: List[bytes], format_type: str = "mp3") -> bytes:
275-
"""Combine multiple audio chunks into a single audio file."""
276-
if not audio_chunks:
277-
return b''
278-
279-
fmt = format_type.lower()
280-
281-
if AudioSegment is None and fmt != "wav":
282-
raise RuntimeError("Combining audio requires pydub for non-WAV formats. Install ttsfm[web].")
283-
284-
try:
285-
if AudioSegment is None:
286-
return _simple_wav_concatenation(audio_chunks)
287-
288-
audio_segments = []
289-
for chunk in audio_chunks:
290-
buffer = io.BytesIO(chunk)
291-
if fmt == "mp3":
292-
segment = AudioSegment.from_mp3(buffer)
293-
elif fmt == "wav":
294-
segment = AudioSegment.from_wav(buffer)
295-
else:
296-
# OPUS/FLAC/AAC/PCM all require an explicit decoder hint
297-
segment = AudioSegment.from_file(buffer, format=fmt)
298-
audio_segments.append(segment)
299-
300-
combined = audio_segments[0]
301-
for segment in audio_segments[1:]:
302-
combined += segment
303-
304-
output_buffer = io.BytesIO()
305-
export_format = fmt if fmt in {"mp3", "wav", "aac", "flac", "opus", "pcm"} else "wav"
306-
combined.export(output_buffer, format=export_format)
307-
return output_buffer.getvalue()
308-
except Exception as exc:
309-
logger.error("Error combining audio chunks: %s", exc)
310-
raise
311-
312-
def _simple_wav_concatenation(wav_chunks: List[bytes]) -> bytes:
313-
"""
314-
Simple WAV file concatenation without external dependencies.
315-
This is a basic implementation that works for simple WAV files.
316-
"""
317-
if not wav_chunks:
318-
return b''
319-
320-
if len(wav_chunks) == 1:
321-
return wav_chunks[0]
322-
323-
try:
324-
# For WAV files, we can do a simple concatenation by:
325-
# 1. Taking the header from the first file
326-
# 2. Concatenating all the audio data
327-
# 3. Updating the file size in the header
328-
329-
first_wav = wav_chunks[0]
330-
if len(first_wav) < 44: # WAV header is at least 44 bytes
331-
return b''.join(wav_chunks)
332-
333-
# Extract header from first file (first 44 bytes)
334-
header = bytearray(first_wav[:44])
335-
336-
# Collect all audio data (skip headers for subsequent files)
337-
audio_data = first_wav[44:] # Audio data from first file
338-
339-
for wav_chunk in wav_chunks[1:]:
340-
if len(wav_chunk) > 44:
341-
audio_data += wav_chunk[44:] # Skip header, append audio data
342-
343-
# Update file size in header (bytes 4-7)
344-
total_size = len(header) + len(audio_data) - 8
345-
header[4:8] = total_size.to_bytes(4, byteorder='little')
346-
347-
# Update data chunk size in header (bytes 40-43)
348-
data_size = len(audio_data)
349-
header[40:44] = data_size.to_bytes(4, byteorder='little')
350-
351-
return bytes(header) + audio_data
352-
353-
except Exception as e:
354-
logger.error(f"Error in simple WAV concatenation: {e}")
355-
# Ultimate fallback
356-
return b''.join(wav_chunks)
357-
358270
def _is_safe_url(target: Optional[str]) -> bool:
359271
"""Validate that a target URL is safe for redirection.
360272
@@ -787,7 +699,7 @@ def get_status():
787699
return jsonify({
788700
"status": "online",
789701
"tts_service": "openai.fm (free)",
790-
"package_version": "3.3.0-alpha2",
702+
"package_version": "3.3.0-alpha3",
791703
"timestamp": datetime.now().isoformat()
792704
})
793705

@@ -805,7 +717,7 @@ def health_check():
805717
"""Simple health check endpoint."""
806718
return jsonify({
807719
"status": "healthy",
808-
"package_version": "3.3.0-alpha2",
720+
"package_version": "3.3.0-alpha3",
809721
"timestamp": datetime.now().isoformat()
810722
})
811723

@@ -1109,4 +1021,3 @@ def internal_error(error):
11091021
finally:
11101022
logger.info("TTSFM web application shut down")
11111023

1112-

ttsfm-web/templates/base.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@
8888
<a class="navbar-brand" href="{{ url_for('index') }}">
8989
<i class="fas fa-microphone-alt me-2"></i>
9090
<span class="fw-bold">TTSFM</span>
91-
<span class="badge bg-primary ms-2 small">v3.3.0-alpha2</span>
91+
<span class="badge bg-primary ms-2 small">v3.3.0-alpha3</span>
9292
</a>
9393

9494
<button class="navbar-toggler border-0" type="button" data-bs-toggle="collapse" data-bs-target="#navbarNav" aria-controls="navbarNav" aria-expanded="false" aria-label="Toggle navigation">
@@ -159,7 +159,7 @@
159159
<div class="d-flex align-items-center">
160160
<i class="fas fa-microphone-alt me-2 text-primary"></i>
161161
<strong class="text-dark">TTSFM</strong>
162-
<span class="ms-2 text-muted">v3.3.0-alpha2</span>
162+
<span class="ms-2 text-muted">v3.3.0-alpha3</span>
163163
</div>
164164
</div>
165165
<div class="col-md-6 text-md-end">

ttsfm/__init__.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,12 +57,13 @@
5757
QuotaExceededException,
5858
AudioProcessingException
5959
)
60+
from .audio import combine_audio_chunks, combine_responses
6061
from .utils import (
6162
validate_text_length,
6263
split_text_by_length
6364
)
6465

65-
__version__ = "3.3.0-alpha2"
66+
__version__ = "3.3.0-alpha3"
6667
__author__ = "dbcccc"
6768
__email__ = "[email protected]"
6869
__description__ = "Text-to-Speech API Client with OpenAI compatibility"
@@ -124,7 +125,7 @@ def generate_speech(text: str, voice: str = "alloy", **kwargs) -> bytes:
124125

125126
return default_client.generate_speech(text=text, voice=voice, **kwargs)
126127

127-
def generate_speech_long_text(text: str, voice: str = "alloy", **kwargs) -> list:
128+
def generate_speech_long_text(text: str, voice: str = "alloy", **kwargs):
128129
"""
129130
Convenience function to generate speech from long text using the default client.
130131
@@ -183,6 +184,8 @@ def generate_speech_long_text(text: str, voice: str = "alloy", **kwargs) -> list
183184
# Utility functions
184185
"validate_text_length",
185186
"split_text_by_length",
187+
"combine_audio_chunks",
188+
"combine_responses",
186189

187190
# Package metadata
188191
"__version__",

0 commit comments

Comments
 (0)