This document describes the MQTT + UDP hybrid protocol used between the device and the server, based on the current implementation: MQTT carries control messages, UDP carries real-time audio.
The protocol uses two channels:
- MQTT - control messages, state synchronization, JSON payloads.
- UDP - real-time audio, encrypted.
- Dual channel design - control is separated from data so audio has low latency.
- Encrypted transport - UDP audio is encrypted with AES-CTR.
- Sequence numbers - guard against replay and reordering.
- Automatic reconnect - MQTT reconnects on disconnect.
sequenceDiagram
participant Device as ESP32 device
participant MQTT as MQTT broker
participant UDP as UDP server
Note over Device, UDP: 1. Establish MQTT connection
Device->>MQTT: MQTT Connect
MQTT->>Device: Connected
Note over Device, UDP: 2. Request audio channel
Device->>MQTT: Hello message (type: "hello", transport: "udp")
MQTT->>Device: Hello response (UDP endpoint + encryption keys)
Note over Device, UDP: 3. Establish UDP connection
Device->>UDP: UDP Connect
UDP->>Device: Connected
Note over Device, UDP: 4. Audio streaming
loop Audio stream
Device->>UDP: Encrypted audio (Opus)
UDP->>Device: Encrypted audio (Opus)
end
Note over Device, UDP: 5. Control messages
par Control
Device->>MQTT: Listen / TTS / MCP messages
MQTT->>Device: STT / TTS / MCP / Alert responses
end
Note over Device, UDP: 6. Teardown
Device->>MQTT: Goodbye
Device->>UDP: Disconnect
The device connects to the broker using:
- Endpoint - broker host and port.
- Client ID - device identifier.
- Username / Password - credentials.
- Keep Alive - heartbeat interval (default 240 s).
{
"type": "hello",
"version": 3,
"transport": "udp",
"features": {
"mcp": true,
"aec": true
},
"audio_params": {
"format": "opus",
"sample_rate": 16000,
"channels": 1,
"frame_duration": 60
}
}features.mcp is always set; features.aec is set when CONFIG_USE_SERVER_AEC is enabled.
{
"type": "hello",
"transport": "udp",
"session_id": "xxx",
"audio_params": {
"format": "opus",
"sample_rate": 24000,
"channels": 1,
"frame_duration": 60
},
"udp": {
"server": "192.168.1.100",
"port": 8888,
"key": "0123456789ABCDEF0123456789ABCDEF",
"nonce": "0123456789ABCDEF0123456789ABCDEF"
}
}Field reference:
udp.server- UDP server address.udp.port- UDP server port.udp.key- AES key, hex-encoded.udp.nonce- AES nonce, hex-encoded.
-
Listen
{ "session_id": "xxx", "type": "listen", "state": "start", "mode": "manual" } -
Abort
{ "session_id": "xxx", "type": "abort", "reason": "wake_word_detected" } -
MCP
{ "session_id": "xxx", "type": "mcp", "payload": { "jsonrpc": "2.0", "id": 1, "result": {} } } -
Goodbye
{ "session_id": "xxx", "type": "goodbye" }
Semantics match the WebSocket protocol. Supported types:
- STT - speech recognition result.
- TTS - TTS lifecycle (
start,stop,sentence_start). - LLM - emotion update for the UI.
- MCP - IoT control.
- System - system control, e.g.
"command": "reboot". - Alert - show an alert on the UI; fields:
status,message,emotion. - Goodbye - server-initiated shutdown of the audio session. The device responds by closing the UDP channel without sending its own goodbye.
- Custom (optional, enabled via
CONFIG_RECEIVE_CUSTOM_MESSAGE).
Example alert:
{
"session_id": "xxx",
"type": "alert",
"status": "Warning",
"message": "Battery low",
"emotion": "sad"
}After the device receives the MQTT hello response, it:
- Parses the UDP host and port.
- Parses the AES key and nonce.
- Initializes the AES-CTR context.
- Opens the UDP socket.
|type 1B|flags 1B|payload_len 2B|ssrc 4B|timestamp 4B|sequence 4B|
|payload payload_len bytes|
Field reference:
type: packet type, always0x01.flags: flags, currently unused.payload_len: payload length (network byte order).ssrc: synchronization source identifier.timestamp: timestamp (network byte order).sequence: sequence number (network byte order).payload: encrypted Opus audio data.
Uses AES-CTR with:
- Key: 128-bit, provided by the server.
- Nonce: 128-bit, provided by the server.
- Counter: built from the timestamp and sequence number.
- Sender:
local_sequence_is incremented monotonically. - Receiver:
remote_sequence_validates continuity. - Anti-replay: packets with sequence numbers below the expected value are dropped.
- Tolerance: small gaps are logged as warnings but still accepted.
- Decryption failure - log an error and drop the packet.
- Sequence gap - log a warning, continue processing the packet.
- Malformed packet - log an error and drop.
stateDiagram
direction TB
[*] --> Disconnected
Disconnected --> MqttConnecting: StartMqttClient()
MqttConnecting --> MqttConnected: MQTT Connected
MqttConnecting --> Disconnected: Connect failed
MqttConnected --> RequestingChannel: OpenAudioChannel()
RequestingChannel --> ChannelOpened: Hello exchange success
RequestingChannel --> MqttConnected: Hello timeout / failed
ChannelOpened --> UdpConnected: UDP connect success
UdpConnected --> AudioStreaming: Start audio
AudioStreaming --> UdpConnected: Stop audio
UdpConnected --> ChannelOpened: UDP disconnect
ChannelOpened --> MqttConnected: CloseAudioChannel()
MqttConnected --> Disconnected: MQTT disconnect
The device determines whether the audio channel is available with:
bool IsAudioChannelOpened() const {
return udp_ != nullptr && !error_occurred_ && !IsTimeout();
}Read from storage:
endpoint- broker address.client_id- client identifier.username- user name.password- password.keepalive- keep-alive interval (default 240 s).publish_topic- publish topic.
- Format: Opus
- Sample rate: 16 kHz device / 24 kHz server
- Channels: 1 (mono)
- Frame duration: 60 ms
- Automatic retry on connect failure.
- Optional error reporting.
- Clean-up runs on disconnect.
- No automatic retry; depends on re-negotiation via MQTT.
- Status can be queried at any time.
The base Protocol class provides timeout detection:
- Default timeout: 120 s.
- Based on the time since the last incoming packet.
- After a timeout the channel is marked unavailable.
- MQTT: supports TLS/SSL (port 8883).
- UDP: AES-CTR on audio payloads.
- MQTT: user name / password.
- UDP: keys are distributed via the MQTT channel.
- Monotonically increasing sequence numbers.
- Stale packets are dropped.
- Timestamps are validated.
A mutex protects the UDP connection:
std::lock_guard<std::mutex> lock(channel_mutex_);- Network objects are created and destroyed dynamically.
- Audio packets are managed with smart pointers.
- Encryption contexts are released promptly.
- UDP connection reuse.
- Reasonable packet sizes.
- Sequence continuity checks.
| Feature | MQTT + UDP | WebSocket |
|---|---|---|
| Control channel | MQTT | WebSocket |
| Audio channel | UDP (encrypted) | WebSocket (binary) |
| Latency | Low (UDP) | Medium |
| Reliability | Medium | High |
| Complexity | High | Low |
| Encryption | AES-CTR | TLS |
| Firewall friendliness | Low | High |
- Ensure UDP ports are reachable.
- Configure firewall rules accordingly.
- Plan for NAT traversal if needed.
- MQTT broker configuration.
- UDP server deployment.
- Key management.
- Connection success rate.
- Audio transmission latency.
- Packet loss.
- Decryption failures.
The MQTT + UDP hybrid protocol achieves efficient audio communication through:
- Split architecture - separate control and data channels with clear responsibilities.
- Encryption - AES-CTR protects audio payloads.
- Sequence management - prevents replay and reordering.
- Automatic recovery - MQTT reconnects on failure.
- Performance - UDP keeps audio latency low.
The protocol is a good fit for low-latency voice interaction, at the cost of higher network complexity than pure WebSocket.