Add multimodal LLM support as an option in existing Assist voice pipelines #3608

Zaleo80 · 2026-04-28T21:50:02Z

Zaleo80
Apr 28, 2026

Describe the feature

Add optional multimodal LLM support to the existing Home Assistant Assist voice pipelines.

The goal is not to replace the current pipeline model, but to allow a pipeline to use a model that can handle audio input and/or audio output directly when the selected provider supports it.

Today, Assist pipelines are mainly built around separate stages, speech-to-text, intent/conversation, and text-to-speech. This works well for classic voice assistants, but it limits newer multimodal models that can understand spoken input and generate spoken output directly.

Example commands

For example:

“Turn on sleep mode and respond in a whispering voice.”

In that case, Home Assistant should be able to execute the sleep mode action and let the multimodal model respond in a whispering or calm voice, instead of relying only on a fixed TTS voice configuration.

Another example is multilingual use:

A user speaks Dutch, the next user speaks English, and another user speaks German.

The response comes in the same language as the requested language (or even the same accent)

Use cases

The important part is that this opens the option to prompt the voice response itself and let the assistant adjust the spoken response to the request.

A multimodal-capable pipeline could allow the model to detect the language per request and respond in the same language, without requiring separate static STT/TTS language settings or different pipelines for every language as long as the language is supported by the model.

Anything else?

There are broader discussions about AI-first voice assistants and multimodal live APIs or to achieve Ultra-Low Latency but this request is intentionally more focused.

This is specifically about adding multimodal support as an option inside existing Assist pipelines, so users can choose between classic, hybrid, and multimodal-capable pipelines.

prashantweb3task-collab · 2026-05-06T05:14:44Z

prashantweb3task-collab
May 6, 2026

This would be a really useful improvement for Assist pipelines. The current STT → conversation → TTS flow works well, but adding optional multimodal support would make Home Assistant much more flexible for newer AI voice models.
I especially like the multilingual and dynamic voice response examples. Being able to process spoken input directly and reply in the same language, tone, or style (like whispering or calm voice) would create a much more natural assistant experience without managing multiple pipelines.
Keeping it optional inside existing pipelines also makes sense, since users could still choose classic pipelines where needed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home Assistant

Add multimodal LLM support as an option in existing Assist voice pipelines #3608

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Home Assistant

Add multimodal LLM support as an option in existing Assist voice pipelines #3608

Uh oh!

Zaleo80 Apr 28, 2026

Describe the feature

Example commands

Use cases

Anything else?

Replies: 1 comment

Uh oh!

prashantweb3task-collab May 6, 2026

Zaleo80
Apr 28, 2026

prashantweb3task-collab
May 6, 2026