Skip to content

Commit edada01

Browse files
xitzhangXiting Zhang
andauthored
[VoiceLive] Update for 2026-06-01-preview (#47089)
* Add VoiceLive beta API updates * update cspell names * update docs * Update VoiceLive docs and tests * Align VoiceLive beta defaults and fixes --------- Co-authored-by: Xiting Zhang <xitzhang@microsoft.com>
1 parent ee74719 commit edada01

24 files changed

Lines changed: 1494 additions & 186 deletions

sdk/voicelive/azure-ai-voicelive/CHANGELOG.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,40 @@
11
# Release History
22

3+
## 1.3.0b1 (Unreleased)
4+
5+
### Features Added
6+
7+
- **Azure Realtime Native Voice Support**: Added `AzureRealtimeNativeVoice` and
8+
`AzureRealtimeNativeVoiceName`, and expanded `voice` fields to accept Azure realtime native voices.
9+
- **WebRTC Call Negotiation Support**: Added `ClientEventRtcCallSdpCreate`, `ServerEventRtcCallSdpCreated`,
10+
`ServerEventRtcCallError`, and `RtcCallErrorDetails` for SDP-based WebRTC call setup.
11+
- **Input Text Streaming Support**: Added `ClientEventInputTextDelta` and `ClientEventInputTextDone`
12+
for incrementally streaming text input into existing conversation items.
13+
- **Hosted Agent Invocation Input**: Added `invoke_input` to `ResponseCreateParams` and
14+
`ServerEventResponseInvocationDelta` for hosted agent invocation passthrough data.
15+
- **Audio Playback Lifecycle Events**: Added `ServerEventOutputAudioBufferStarted` and
16+
`ServerEventOutputAudioBufferStopped` to track model audio playback start and stop.
17+
- **Echo Cancellation Configuration**: Added `EchoCancellationReferenceSource` and new
18+
`reference_source` / `channels` options on `AudioEchoCancellation` to support both the default
19+
server loopback reference path and client-provided stereo echo reference input.
20+
- **Smart End-of-Turn Detection**: Added `SmartEndOfTurnDetection` as an audio-based end-of-turn
21+
detection option.
22+
- **Parallel Tool Call Control**: Added `parallel_tool_calls` to session models so callers can
23+
control whether tool calls may run in parallel.
24+
25+
### Breaking Changes
26+
27+
- **Image Input Field Rename**: Renamed `RequestImageContentPart.url` to `image_url`. Update
28+
image input construction to use `image_url=` instead of `url=`.
29+
- **Default API Version Update**: Changed the SDK default API version from `2026-04-10` to
30+
`2026-06-01-preview`. Pass `api_version="2026-04-10"` explicitly to keep the previous default
31+
behavior.
32+
33+
### Bug Fixes
34+
35+
- **Deserialization Improvements**: Improved XML model deserialization and common scalar header
36+
deserialization paths for better compatibility and lower overhead.
37+
338
## 1.2.0 (2026-05-22)
439

540
### Features Added

sdk/voicelive/azure-ai-voicelive/README.md

Lines changed: 41 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ This package provides a **real-time, speech-to-speech** client for Azure AI Voic
55
It opens a WebSocket session to stream microphone audio to the service and receive
66
typed server events (including audio) for responsive, interruptible conversations.
77

8-
> **Status:** General Availability (GA). This is a stable release suitable for production use.
8+
> **Status:** Preview (`1.3.0b1`). This beta release includes the latest SDK and sample updates and may change before the next stable release.
99
1010
> **Important:** As of version 1.0.0, this SDK is **async-only**. The synchronous API has been removed to focus exclusively on async patterns. All examples and samples use `async`/`await` syntax.
1111
@@ -16,34 +16,35 @@ Getting started
1616

1717
### Prerequisites
1818

19-
- **Python 3.9+**
19+
- **Python 3.10+**
2020
- An **Azure subscription**
2121
- A **VoiceLive** resource and endpoint
2222
- A working **microphone** and **speakers/headphones** if you run the voice samples
2323

2424
### Install
2525

26-
Install the stable GA version:
26+
Install the latest preview version:
2727

2828
```bash
2929
# Base install (core client only)
30-
python -m pip install azure-ai-voicelive
30+
python -m pip install --pre azure-ai-voicelive
3131

3232
# For asynchronous streaming (uses aiohttp)
33-
python -m pip install "azure-ai-voicelive[aiohttp]"
33+
python -m pip install --pre "azure-ai-voicelive[aiohttp]"
3434

3535
# For voice samples (includes audio processing)
3636
# First install PyAudio dependencies for your platform:
3737
# Linux: sudo apt-get install -y portaudio19-dev libasound2-dev
3838
# macOS: brew install portaudio
39-
python -m pip install azure-ai-voicelive[aiohttp] pyaudio python-dotenv
39+
python -m pip install --pre "azure-ai-voicelive[aiohttp]" azure-identity pyaudio python-dotenv
4040
```
4141

4242
The SDK provides async-only WebSocket connections using `aiohttp` for optimal performance and reliability.
4343

4444
### Authenticate
4545

46-
You can authenticate with an **API key** or an **Azure Active Directory (AAD) token**.
46+
You can authenticate with an **API key** or a Microsoft Entra ID token.
47+
The samples default to `DefaultAzureCredential`; for local development, `az login` is usually the simplest path.
4748

4849
#### API Key Authentication (Quick Start)
4950

@@ -66,7 +67,7 @@ async def main():
6667
async with connect(
6768
endpoint="your-endpoint",
6869
credential=AzureKeyCredential("your-api-key"),
69-
model="gpt-4o-realtime-preview"
70+
model="gpt-realtime"
7071
) as connection:
7172
# Your async code here
7273
pass
@@ -76,7 +77,7 @@ asyncio.run(main())
7677

7778
#### AAD Token Authentication
7879

79-
For production applications, AAD authentication is recommended:
80+
For production applications, Entra ID authentication is recommended:
8081

8182
```python
8283
import asyncio
@@ -85,14 +86,17 @@ from azure.ai.voicelive import connect
8586

8687
async def main():
8788
credential = DefaultAzureCredential()
88-
89-
async with connect(
90-
endpoint="your-endpoint",
91-
credential=credential,
92-
model="gpt-4o-realtime-preview"
93-
) as connection:
94-
# Your async code here
95-
pass
89+
90+
try:
91+
async with connect(
92+
endpoint="your-endpoint",
93+
credential=credential,
94+
model="gpt-realtime"
95+
) as connection:
96+
# Your async code here
97+
pass
98+
finally:
99+
await credential.close()
96100

97101
asyncio.run(main())
98102
```
@@ -107,13 +111,16 @@ Key concepts
107111
- **SessionResource** – Update session parameters (voice, formats, VAD) with async methods
108112
- **RequestSession** – Strongly-typed session configuration
109113
- **ServerVad** – Configure voice activity detection
114+
- **SmartEndOfTurnDetection** – Configure audio-based end-of-turn detection
110115
- **AzureStandardVoice** – Configure voice settings
116+
- **parallel_tool_calls** – Control whether tool calls may run in parallel for a session
111117
- **Audio Handling**:
112118
- **InputAudioBufferResource** – Manage audio input to the service with async methods
113119
- **OutputAudioBufferResource** – Control audio output from the service with async methods
114120
- **Conversation Management**:
115121
- **ResponseResource** – Create or cancel model responses with async methods
116122
- **ConversationResource** – Manage conversation items with async methods
123+
- **ClientEventInputTextDelta / ClientEventInputTextDone** – Stream text input incrementally into an item
117124
- **Error Handling**:
118125
- **ConnectionError** – Base exception for WebSocket connection errors
119126
- **ConnectionClosed** – Raised when WebSocket connection is closed
@@ -142,7 +149,7 @@ The Basic Voice Assistant sample demonstrates full-featured voice interaction wi
142149
python samples/basic_voice_assistant_async.py
143150

144151
# With custom parameters
145-
python samples/basic_voice_assistant_async.py --model gpt-4o-realtime-preview --voice alloy --instructions "You're a helpful assistant"
152+
python samples/basic_voice_assistant_async.py --model gpt-realtime --voice alloy --instructions "You're a helpful assistant"
146153
```
147154

148155
### Minimal example
@@ -152,12 +159,18 @@ import asyncio
152159
from azure.core.credentials import AzureKeyCredential
153160
from azure.ai.voicelive.aio import connect
154161
from azure.ai.voicelive.models import (
155-
RequestSession, Modality, InputAudioFormat, OutputAudioFormat, ServerVad, ServerEventType
162+
AudioEchoCancellation,
163+
RequestSession,
164+
Modality,
165+
InputAudioFormat,
166+
OutputAudioFormat,
167+
ServerVad,
168+
ServerEventType,
156169
)
157170

158171
API_KEY = "your-api-key"
159172
ENDPOINT = "wss://your-endpoint.com/openai/realtime"
160-
MODEL = "gpt-4o-realtime-preview"
173+
MODEL = "gpt-realtime"
161174

162175
async def main():
163176
async with connect(
@@ -170,6 +183,7 @@ async def main():
170183
instructions="You are a helpful assistant.",
171184
input_audio_format=InputAudioFormat.PCM16,
172185
output_audio_format=OutputAudioFormat.PCM16,
186+
input_audio_echo_cancellation=AudioEchoCancellation(),
173187
turn_detection=ServerVad(
174188
threshold=0.5,
175189
prefix_padding_ms=300,
@@ -187,6 +201,13 @@ async def main():
187201
asyncio.run(main())
188202
```
189203

204+
`AudioEchoCancellation` now supports both the default server loopback reference path and a
205+
client-provided stereo echo reference. Use `reference_source="client"` with `channels=2` only when
206+
your application sends stereo PCM16 input with the microphone on channel 0 and the echo reference
207+
signal on channel 1.
208+
209+
For image inputs, `RequestImageContentPart` uses the `image_url` field name.
210+
190211
Available Voice Options
191212
-----------------------
192213

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
2-
"apiVersion": "2026-04-10",
2+
"apiVersion": "2026-06-01-preview",
33
"apiVersions": {
4-
"VoiceLive": "2026-04-10"
4+
"VoiceLive": "2026-06-01-preview"
55
}
66
}

sdk/voicelive/azure-ai-voicelive/apiview-properties.json

Lines changed: 26 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
"azure.ai.voicelive.models.AzureAvatarVoiceSyncVoice": "VoiceLive.AzureAvatarVoiceSyncVoice",
1919
"azure.ai.voicelive.models.AzureCustomVoice": "VoiceLive.AzureCustomVoice",
2020
"azure.ai.voicelive.models.AzurePersonalVoice": "VoiceLive.AzurePersonalVoice",
21+
"azure.ai.voicelive.models.AzureRealtimeNativeVoice": "VoiceLive.AzureRealtimeNativeVoice",
2122
"azure.ai.voicelive.models.EouDetection": "VoiceLive.EouDetection",
2223
"azure.ai.voicelive.models.AzureSemanticDetection": "VoiceLive.AzureSemanticDetection",
2324
"azure.ai.voicelive.models.AzureSemanticDetectionEn": "VoiceLive.AzureSemanticDetectionEn",
@@ -45,6 +46,7 @@
4546
"azure.ai.voicelive.models.ClientEventOutputAudioBufferClear": "VoiceLive.ClientEventOutputAudioBufferClear",
4647
"azure.ai.voicelive.models.ClientEventResponseCancel": "VoiceLive.ClientEventResponseCancel",
4748
"azure.ai.voicelive.models.ClientEventResponseCreate": "VoiceLive.ClientEventResponseCreate",
49+
"azure.ai.voicelive.models.ClientEventRtcCallSdpCreate": "VoiceLive.ClientEventRtcCallSdpCreate",
4850
"azure.ai.voicelive.models.ClientEventSessionAvatarConnect": "VoiceLive.ClientEventSessionAvatarConnect",
4951
"azure.ai.voicelive.models.ClientEventSessionUpdate": "VoiceLive.ClientEventSessionUpdate",
5052
"azure.ai.voicelive.models.ContentPart": "VoiceLive.ContentPart",
@@ -92,6 +94,7 @@
9294
"azure.ai.voicelive.models.ResponseSession": "VoiceLive.ResponseSession",
9395
"azure.ai.voicelive.models.ResponseTextContentPart": "VoiceLive.ResponseTextContentPart",
9496
"azure.ai.voicelive.models.ResponseWebSearchCallItem": "VoiceLive.ResponseWebSearchCallItem",
97+
"azure.ai.voicelive.models.RtcCallErrorDetails": "VoiceLive.RtcCallErrorDetails",
9598
"azure.ai.voicelive.models.Scene": "VoiceLive.Scene",
9699
"azure.ai.voicelive.models.ServerEvent": "VoiceLive.ServerEvent",
97100
"azure.ai.voicelive.models.ServerEventConversationItemCreated": "VoiceLive.ServerEventConversationItemCreated",
@@ -111,6 +114,8 @@
111114
"azure.ai.voicelive.models.ServerEventMcpListToolsFailed": "VoiceLive.ServerEventMcpListToolsFailed",
112115
"azure.ai.voicelive.models.ServerEventMcpListToolsInProgress": "VoiceLive.ServerEventMcpListToolsInProgress",
113116
"azure.ai.voicelive.models.ServerEventOutputAudioBufferCleared": "VoiceLive.ServerEventOutputAudioBufferCleared",
117+
"azure.ai.voicelive.models.ServerEventOutputAudioBufferStarted": "VoiceLive.ServerEventOutputAudioBufferStarted",
118+
"azure.ai.voicelive.models.ServerEventOutputAudioBufferStopped": "VoiceLive.ServerEventOutputAudioBufferStopped",
114119
"azure.ai.voicelive.models.ServerEventResponseAnimationBlendshapeDelta": "VoiceLive.ServerEventResponseAnimationBlendshapeDelta",
115120
"azure.ai.voicelive.models.ServerEventResponseAnimationBlendshapeDone": "VoiceLive.ServerEventResponseAnimationBlendshapeDone",
116121
"azure.ai.voicelive.models.ServerEventResponseAnimationVisemeDelta": "VoiceLive.ServerEventResponseAnimationVisemeDelta",
@@ -131,6 +136,7 @@
131136
"azure.ai.voicelive.models.ServerEventResponseFileSearchCallSearching": "VoiceLive.ServerEventResponseFileSearchCallSearching",
132137
"azure.ai.voicelive.models.ServerEventResponseFunctionCallArgumentsDelta": "VoiceLive.ServerEventResponseFunctionCallArgumentsDelta",
133138
"azure.ai.voicelive.models.ServerEventResponseFunctionCallArgumentsDone": "VoiceLive.ServerEventResponseFunctionCallArgumentsDone",
139+
"azure.ai.voicelive.models.ServerEventResponseInvocationDelta": "VoiceLive.ServerEventResponseInvocationDelta",
134140
"azure.ai.voicelive.models.ServerEventResponseMcpCallArgumentsDelta": "VoiceLive.ServerEventResponseMcpCallArgumentsDelta",
135141
"azure.ai.voicelive.models.ServerEventResponseMcpCallArgumentsDone": "VoiceLive.ServerEventResponseMcpCallArgumentsDone",
136142
"azure.ai.voicelive.models.ServerEventResponseMcpCallCompleted": "VoiceLive.ServerEventResponseMcpCallCompleted",
@@ -144,6 +150,8 @@
144150
"azure.ai.voicelive.models.ServerEventResponseWebSearchCallCompleted": "VoiceLive.ServerEventResponseWebSearchCallCompleted",
145151
"azure.ai.voicelive.models.ServerEventResponseWebSearchCallInProgress": "VoiceLive.ServerEventResponseWebSearchCallInProgress",
146152
"azure.ai.voicelive.models.ServerEventResponseWebSearchCallSearching": "VoiceLive.ServerEventResponseWebSearchCallSearching",
153+
"azure.ai.voicelive.models.ServerEventRtcCallError": "VoiceLive.ServerEventRtcCallError",
154+
"azure.ai.voicelive.models.ServerEventRtcCallSdpCreated": "VoiceLive.ServerEventRtcCallSdpCreated",
147155
"azure.ai.voicelive.models.ServerEventSessionAvatarConnecting": "VoiceLive.ServerEventSessionAvatarConnecting",
148156
"azure.ai.voicelive.models.ServerEventSessionAvatarSwitchToIdle": "VoiceLive.ServerEventSessionAvatarSwitchToIdle",
149157
"azure.ai.voicelive.models.ServerEventSessionAvatarSwitchToSpeaking": "VoiceLive.ServerEventSessionAvatarSwitchToSpeaking",
@@ -165,35 +173,37 @@
165173
"azure.ai.voicelive.models.VideoParams": "VoiceLive.VideoParams",
166174
"azure.ai.voicelive.models.VideoResolution": "VoiceLive.VideoResolution",
167175
"azure.ai.voicelive.models.VoiceLiveErrorDetails": "VoiceLive.VoiceLiveErrorDetails",
168-
"azure.ai.voicelive.models.ClientEventType": "VoiceLive.ClientEventType",
169-
"azure.ai.voicelive.models.ItemType": "VoiceLive.ItemType",
170-
"azure.ai.voicelive.models.ItemParamStatus": "VoiceLive.ItemParamStatus",
171-
"azure.ai.voicelive.models.MessageRole": "VoiceLive.MessageRole",
172-
"azure.ai.voicelive.models.ContentPartType": "VoiceLive.ContentPartType",
173-
"azure.ai.voicelive.models.Modality": "VoiceLive.Modality",
176+
"azure.ai.voicelive.models.AnimationOutputType": "VoiceLive.AnimationOutputType",
174177
"azure.ai.voicelive.models.OpenAIVoiceName": "VoiceLive.OAIVoice",
175178
"azure.ai.voicelive.models.AzureVoiceType": "VoiceLive.AzureVoiceType",
176179
"azure.ai.voicelive.models.PersonalVoiceModels": "VoiceLive.PersonalVoiceModels",
177-
"azure.ai.voicelive.models.OutputAudioFormat": "VoiceLive.OutputAudioFormat",
180+
"azure.ai.voicelive.models.AzureRealtimeNativeVoiceName": "VoiceLive.AzureRealtimeNativeVoiceName",
181+
"azure.ai.voicelive.models.EouThresholdLevel": "VoiceLive.EouThresholdLevel",
182+
"azure.ai.voicelive.models.TurnDetectionType": "VoiceLive.TurnDetectionType",
183+
"azure.ai.voicelive.models.EchoCancellationReferenceSource": "VoiceLive.EchoCancellationReferenceSource",
184+
"azure.ai.voicelive.models.AvatarConfigTypes": "VoiceLive.AvatarConfigTypes",
185+
"azure.ai.voicelive.models.PhotoAvatarBaseModes": "VoiceLive.PhotoAvatarBaseModes",
186+
"azure.ai.voicelive.models.AvatarOutputProtocol": "VoiceLive.AvatarOutputProtocol",
178187
"azure.ai.voicelive.models.ToolType": "VoiceLive.ToolType",
179188
"azure.ai.voicelive.models.MCPApprovalType": "VoiceLive.MCPApprovalType",
180-
"azure.ai.voicelive.models.ReasoningEffort": "VoiceLive.ReasoningEffort",
181189
"azure.ai.voicelive.models.InterimResponseConfigType": "VoiceLive.InterimResponseConfigType",
182190
"azure.ai.voicelive.models.InterimResponseTrigger": "VoiceLive.InterimResponseTrigger",
183-
"azure.ai.voicelive.models.AnimationOutputType": "VoiceLive.AnimationOutputType",
191+
"azure.ai.voicelive.models.Modality": "VoiceLive.Modality",
184192
"azure.ai.voicelive.models.InputAudioFormat": "VoiceLive.InputAudioFormat",
185-
"azure.ai.voicelive.models.TurnDetectionType": "VoiceLive.TurnDetectionType",
186-
"azure.ai.voicelive.models.EouThresholdLevel": "VoiceLive.EouThresholdLevel",
187-
"azure.ai.voicelive.models.AvatarConfigTypes": "VoiceLive.AvatarConfigTypes",
188-
"azure.ai.voicelive.models.PhotoAvatarBaseModes": "VoiceLive.PhotoAvatarBaseModes",
189-
"azure.ai.voicelive.models.AvatarOutputProtocol": "VoiceLive.AvatarOutputProtocol",
193+
"azure.ai.voicelive.models.OutputAudioFormat": "VoiceLive.OutputAudioFormat",
190194
"azure.ai.voicelive.models.AudioTimestampType": "VoiceLive.AudioTimestampType",
191195
"azure.ai.voicelive.models.ToolChoiceLiteral": "VoiceLive.ToolChoiceLiteral",
196+
"azure.ai.voicelive.models.ReasoningEffort": "VoiceLive.ReasoningEffort",
192197
"azure.ai.voicelive.models.SessionIncludeOption": "VoiceLive.SessionIncludeOption",
198+
"azure.ai.voicelive.models.ClientEventType": "VoiceLive.ClientEventType",
199+
"azure.ai.voicelive.models.ItemType": "VoiceLive.ItemType",
200+
"azure.ai.voicelive.models.ItemParamStatus": "VoiceLive.ItemParamStatus",
201+
"azure.ai.voicelive.models.MessageRole": "VoiceLive.MessageRole",
202+
"azure.ai.voicelive.models.ContentPartType": "VoiceLive.ContentPartType",
193203
"azure.ai.voicelive.models.ResponseStatus": "VoiceLive.ResponseStatus",
194-
"azure.ai.voicelive.models.ResponseItemStatus": "VoiceLive.ResponseItemStatus",
195204
"azure.ai.voicelive.models.RequestImageContentPartDetail": "VoiceLive.RequestImageContentPartDetail",
205+
"azure.ai.voicelive.models.ResponseItemStatus": "VoiceLive.ResponseItemStatus",
196206
"azure.ai.voicelive.models.ServerEventType": "VoiceLive.ServerEventType"
197207
},
198-
"CrossLanguageVersion": "4f7c08a38aa5"
208+
"CrossLanguageVersion": "d4391398f022"
199209
}

sdk/voicelive/azure-ai-voicelive/azure/ai/voicelive/_types.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@
1010

1111
if TYPE_CHECKING:
1212
from . import models as _models
13-
Voice = Union[str, "_models.OpenAIVoiceName", "_models.OpenAIVoice", "_models.AzureVoice"]
14-
InterimResponseConfig = Union["_models.StaticInterimResponseConfig", "_models.LlmInterimResponseConfig"]
13+
Voice = Union[
14+
str, "_models.OpenAIVoiceName", "_models.OpenAIVoice", "_models.AzureVoice", "_models.AzureRealtimeNativeVoice"
15+
]
1516
ToolChoice = Union[str, "_models.ToolChoiceLiteral", "_models.ToolChoiceSelection"]
17+
InterimResponseConfig = Union["_models.StaticInterimResponseConfig", "_models.LlmInterimResponseConfig"]

0 commit comments

Comments
 (0)