Add multimodal LLM support as an option in existing Assist voice pipelines #3608
Unanswered
Zaleo80
asked this question in
Voice assistants
Replies: 1 comment
-
|
This would be a really useful improvement for Assist pipelines. The current STT → conversation → TTS flow works well, but adding optional multimodal support would make Home Assistant much more flexible for newer AI voice models. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Describe the feature
Add optional multimodal LLM support to the existing Home Assistant Assist voice pipelines.
The goal is not to replace the current pipeline model, but to allow a pipeline to use a model that can handle audio input and/or audio output directly when the selected provider supports it.
Today, Assist pipelines are mainly built around separate stages, speech-to-text, intent/conversation, and text-to-speech. This works well for classic voice assistants, but it limits newer multimodal models that can understand spoken input and generate spoken output directly.
Example commands
For example:
In that case, Home Assistant should be able to execute the sleep mode action and let the multimodal model respond in a whispering or calm voice, instead of relying only on a fixed TTS voice configuration.
Another example is multilingual use:
The response comes in the same language as the requested language (or even the same accent)
Use cases
The important part is that this opens the option to prompt the voice response itself and let the assistant adjust the spoken response to the request.
A multimodal-capable pipeline could allow the model to detect the language per request and respond in the same language, without requiring separate static STT/TTS language settings or different pipelines for every language as long as the language is supported by the model.
Anything else?
There are broader discussions about AI-first voice assistants and multimodal live APIs or to achieve Ultra-Low Latency but this request is intentionally more focused.
This is specifically about adding multimodal support as an option inside existing Assist pipelines, so users can choose between classic, hybrid, and multimodal-capable pipelines.
Beta Was this translation helpful? Give feedback.
All reactions