Python SDK for Volcengine (ByteDance) Audio Services, providing comprehensive support for Text-to-Speech (TTS), Speech-to-Text (STT), and Realtime Dialogue capabilities.
- Speech-to-Text (STT): Convert audio to text using Volcengine's ASR services (V2 and V3 APIs)
- Text-to-Speech (TTS): Synthesize natural-sounding speech from text with various voice types
- Realtime Dialogue: Bidirectional streaming for interactive voice conversations
- Protocol Support: Low-level protocol utilities for custom implementations
- Type Safety: Full Pydantic model validation for all requests and responses
- 2026-03-04
- if you find any document changes, please let me know or submit a PR
- realtime API source: https://www.volcengine.com/docs/6561/1594356?lang=zh
# From PyPI (when published)
pip install volcengine-audiogit clone https://github.com/aiyou178/volcengine-audio.git
cd volcengine-audio
pip install -e .from volcengine_audio import (
VolcengineAsrRequestV3,
VolcengineAsrFunctionsV3,
STTAudioFormatV3,
)
# Create ASR request
asr_request = VolcengineAsrRequestV3(
audio=VolcengineAsrRequestV3.Audio(
format=STTAudioFormatV3.wav,
rate=16000,
),
request=VolcengineAsrRequestV3.Request(
model_name="bigmodel",
enable_itn=True,
enable_punc=True,
),
)
# Generate request payload
request_params = asr_request.model_dump(exclude_none=True)
full_request = VolcengineAsrFunctionsV3.generate_asr_full_client_request(
sequence=1,
request_params=request_params,
compression=True,
)
# Send audio chunks
audio_request = VolcengineAsrFunctionsV3.generate_asr_audio_only_request(
sequence=2,
audio=audio_chunk,
compress=True,
)
# Parse response
response_data = VolcengineAsrFunctionsV3.parse_response(server_response)
print(response_data['message'])from volcengine_audio import (
VolcengineTTSBidirectionRequest,
VolcengineTTSFunctions,
TTSBigmodelResourceType,
TTSAudioFormat,
EventSend,
)
# Create TTS request
tts_request = VolcengineTTSBidirectionRequest(
event=EventSend.StartSession,
req_params=VolcengineTTSBidirectionRequest.ReqParams(
text="Hello, this is a test.",
speaker="zh_female_vv_jupiter_bigtts",
model=TTSBigmodelResourceType.seed_tts_2_0,
audio_params=VolcengineTTSBidirectionRequest.ReqParams.AudioParams(
format=TTSAudioFormat.mp3,
sample_rate=24000,
),
),
)
# Create connection
connection_payload = VolcengineTTSFunctions.start_connection_payload()
# Start session
session_payload = VolcengineTTSFunctions.start_session_payload(
session_id="unique-session-id",
req_params=tts_request.req_params.model_dump(exclude_none=True),
)
# Parse response
event, session_id, payload = VolcengineTTSFunctions.extract_response_payload(server_response)from volcengine_audio import (
RealtimeDialogueConfig,
RealtimeDialogueFunctions,
ChatTTSTextRequest,
)
# Configure dialogue session
config = RealtimeDialogueConfig(
dialog=RealtimeDialogueConfig.DialogConfig(
bot_name="AI Assistant",
system_role="You are a helpful assistant.",
speaking_style="Professional and friendly.",
),
tts=RealtimeDialogueConfig.TTSConfig(
speaker=RealtimeDialogueConfig.TTSConfig.Speaker.zh_female_vv_jupiter_bigtts,
),
)
# Start connection
connection = RealtimeDialogueFunctions.start_connection_payload()
# Start session
session = RealtimeDialogueFunctions.start_session_payload(
session_id="session-123",
config=config,
)
# Send audio for recognition
audio_payload = RealtimeDialogueFunctions.task_request_payload(
session_id="session-123",
audio_data=audio_bytes,
)
# Request TTS for text
tts_payload = RealtimeDialogueFunctions.chat_tts_text_payload(
session_id="session-123",
tts_request=ChatTTSTextRequest(
start=True,
content="Hello!",
end=True,
),
)
# Finish session
finish = RealtimeDialogueFunctions.finish_session_payload("session-123")Core protocol definitions and utilities.
Classes:
ProtocolVersion: Protocol version enumeration (V1)MessageType: Message types for bidirectional communicationEventSend: Events sent from client to serverEventReceive: Events received from serverSerializationMethod: Payload serialization methods (JSON, RAW, PROTOBUF)CompressionMethod: Payload compression methods (NONE, GZIP)
Constants:
HOST:'openspeech.bytedance.com'- Volcengine audio service host
Functions:
generate_header(): Generate protocol header for requestsgenerate_before_payload(): Generate sequence number before payload
Speech-to-Text (ASR) models and utilities.
Request Models:
VolcengineAsrRequestV3: ASR V3 API requestVolcengineAsrRequestV2: ASR V2 API request
Response Models:
AsrFullServerResponseV2: Full server response for V2ListenBidirectionPackage: Bidirectional listening package
Enums:
STTResource: STT resource types for billingSTTAudioFormatV3: Audio formats (pcm, wav, mp3, ogg)STTResultType: Result types (full, single)STTBigmodelNoStreamLanguage: Supported languages for bigmodel
Helper Classes:
VolcengineAsrFunctionsV3: V3 API helper functionsgenerate_asr_full_client_request(): Generate full client requestgenerate_asr_audio_only_request(): Generate audio-only requestparse_response(): Parse server response
VolcengineAsrFunctionsV2: V2 API helper functionsfull_client_request(): Generate full client requestaudio_only_request(): Generate audio-only request
Text-to-Speech models and utilities.
Request Models:
VolcengineTTSRequest: Standard TTS requestVolcengineTTSBidirectionRequest: Bidirectional TTS requestTTSReqParams: TTS request parameters with audio settings
Response Models:
TTSSentenceStartResponse: Sentence start notificationTTSSentenceEndResponse: Sentence end notificationTTSEndResponse: TTS ended notification
Enums:
TTSBigmodelResourceType: TTS model types (seed-tts-1.0, seed-tts-2.0, etc.)TTSAudioFormat: Audio formats (wav, pcm, mp3, ogg_opus)
Helper Classes:
VolcengineTTSFunctions: TTS API helper functionsstart_connection_payload(): Start connectionstart_session_payload(): Start TTS sessionfinish_session_payload(): Finish TTS sessionextract_response_payload(): Extract and parse responsecalculate_payload(): Calculate request payload
Realtime dialogue (combined TTS+STT) models and utilities.
Configuration:
RealtimeDialogueConfig: Complete dialogue session configurationDialogConfig: Bot persona, speaking style, locationTTSConfig: Voice type and audio settingsAsr: ASR-specific settings
Request Models:
SayHelloRequest: Greeting messageChatTTSTextRequest: Text to synthesize with TTSChatTextQueryRequest: Text query for dialogue
Response Models:
ASRInfoResponse: ASR task info (first word detection)ASRResponseModel: ASR recognition resultASREndedResponse: ASR ended notificationChatResponseModel: Chat responseSessionStartedResponse: Session startedSessionFailedResponse: Session failed
Helper Classes:
RealtimeDialogueFunctions: Realtime dialogue API helpersstart_connection_payload(): Start connectionstart_session_payload(): Start dialogue sessiontask_request_payload(): Send audio for recognitionsay_hello_payload(): Send greetingchat_tts_text_payload(): Request TTS for textchat_text_query_payload(): Send text queryfinish_session_payload(): Finish session
All messages follow a standard protocol structure:
[Header 4 bytes][Optional Fields][Payload Size 4 bytes][Payload]
Byte 0: [protocol_version:4 bits][header_size:4 bits]
Byte 1: [message_type:4 bits][message_type_specific_flags:4 bits]
Byte 2: [serialization_method:4 bits][compression:4 bits]
Byte 3: [reserved:8 bits]
- V1 (0b0001): Current protocol version
Client → Server:
FULL_CLIENT_REQUEST (0b0001): Full request with metadataAUDIO_ONLY_REQUEST (0b0010): Audio-only request
Server → Client:
FULL_SERVER_RESPONSE (0b1001): Full response with metadataAUDIO_ONLY_RESPONSE (0b1011): Audio-only responseERROR_INFORMATION (0b1111): Error information
RAW (0b0000): Raw binary dataJSON (0b0001): JSON-encoded payloadPROTOBUF (0b0010): Protocol BuffersTHRIFT (0b0011): Apache Thrift
NONE (0b0000): No compressionGZIP (0b0001): GZIP compression
Client Server
| |
|-- StartConnection ----------->|
|<---------- ConnectionStarted--|
| |
|-- StartSession -------------->|
|<------------ SessionStarted---|
| |
|-- TaskRequest (text) -------->|
|<--------- TTSSentenceStart----|
|<--------- TTSResponse (audio)-|
|<----------- TTSSentenceEnd----|
| |
|-- FinishSession ------------->|
|<---------- SessionFinished----|
| |
|-- FinishConnection ---------->|
|<-------- ConnectionFinished---|
Client Server
| |
|-- FullClientRequest --------->|
| |
|-- AudioOnlyRequest (chunk1)-->|
|<------------- FullResponse----|
| |
|-- AudioOnlyRequest (chunk2)-->|
|<------------- FullResponse----|
| |
|-- AudioOnlyRequest (last) --->|
|<------------- FullResponse----|
Client Server
| |
|-- StartConnection ----------->|
|<---------- ConnectionStarted--|
| |
|-- StartSession (config) ----->|
|<------------ SessionStarted---|
| |
|-- TaskRequest (audio) ------->|
|<-------------- ASRInfo--------|
|<------------ ASRResponse------|
|<-------------- ASREnded-------|
| |
|<----------- ChatResponse------|
|<------- TTSSentenceStart------|
|<--------- TTSResponse (audio)-|
|<--------- TTSSentenceEnd------|
|<------------- ChatEnded-------|
| |
|-- FinishSession ------------->|
|<---------- SessionFinished----|
from volcengine_audio import VolcengineAsrRequestV3
request = VolcengineAsrRequestV3(
request=VolcengineAsrRequestV3.Request(
corpus=VolcengineAsrRequestV3.Request.Corpus(
context=VolcengineAsrRequestV3.Request.Corpus.Context(
hotwords=[
{"word": "Volcengine"},
{"word": "ByteDance"},
],
context_type="dialog_ctx",
),
),
sensitive_words_filter=VolcengineAsrRequestV3.Request.SensitiveWordsFilter(
system_reserved_filter=True,
filter_with_signed=["badword1", "badword2"],
),
),
)from volcengine_audio import VolcengineTTSBidirectionRequest
request = VolcengineTTSBidirectionRequest.ReqParams(
text="Hello",
speaker="custom_mix",
mix_speaker=VolcengineTTSBidirectionRequest.ReqParams.MixSpeaker(
speakers=[
{
"source_speaker": "zh_female_vv_jupiter_bigtts",
"mix_factor": 0.6,
},
{
"source_speaker": "zh_male_yunzhou_jupiter_bigtts",
"mix_factor": 0.4,
},
],
),
)from volcengine_audio import TTSReqParams
audio_params = TTSReqParams.AudioParams(
emotion="happy",
emotion_scale=5, # Max intensity
speech_rate=50, # 1.5x speed
loudness_rate=20, # 1.2x volume
pitch=2, # Slightly higher pitch
)from volcengine_audio import RealtimeDialogueConfig
config = RealtimeDialogueConfig(
dialog=RealtimeDialogueConfig.DialogConfig(
extra=RealtimeDialogueConfig.DialogConfig.Extra(
enable_volc_websearch=True,
volc_websearch_type="web_summary",
volc_websearch_api_key="your-api-key",
volc_websearch_result_count=5,
),
),
)from volcengine_audio import EventReceive
event, session_id, payload = VolcengineTTSFunctions.extract_response_payload(response)
if event == EventReceive.SessionFailed:
print(f"Session failed: {payload.get('error')}")
elif event == EventReceive.ConnectionFailed:
print(f"Connection failed: {payload.get('error')}")
elif event == EventReceive.SERVER_PROCESSING_ERROR:
print("Server processing error")pytest tests/This package uses Ruff for linting and formatting:
ruff check src/ tests/
ruff format src/ tests/MIT