Skip to content

Conversation

@richiejp
Copy link
Collaborator

@richiejp richiejp commented Sep 10, 2025

Description

Add enough realtime API features to allow talking with an LLM using only audio.

Presently the realtime API only supports transcription which is a minor use-case for it. This PR should allow it to be used with a basic voice assistant.

This PR will ignore many of the options and edge-cases. Instead it'll just, for e.g., rely on server side VAD to commit conversation items.

Notes for Reviewers

  • Configure a model pipeline or use a multi-modal model.
  • Commit client audio to the conversation
  • Generate a text response (optional)
  • Generate an audio response
  • Interrupt generation on voice detection?
  • Implement message item retrieval so client can get the audio
  • Allow the user to configure a composite model (pipeline model) (or can we use existing options for e.g. selecting voice style?)
  • Test and fix bugs in new code
  • Test for regressions in transcription mode

Fixes: #3714 (but we'll need follow issues)

Signed commits

  • Yes, I signed my commits.

@netlify
Copy link

netlify bot commented Sep 10, 2025

Deploy Preview for localai ready!

Name Link
🔨 Latest commit 37606d4
🔍 Latest deploy log https://app.netlify.com/projects/localai/deploys/69611b60e2a0c7000827439a
😎 Deploy Preview https://deploy-preview-6245--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@mudler mudler added the roadmap label Sep 11, 2025
@richiejp
Copy link
Collaborator Author

It's not clear to me if we have audio support in llama.cpp: ggml-org/llama.cpp#15194

@richiejp
Copy link
Collaborator Author

ggml-org/llama.cpp#13759

@richiejp
Copy link
Collaborator Author

ggml-org/llama.cpp#13784

@mudler
Copy link
Owner

mudler commented Sep 21, 2025

my initial thought on this was to use the whisper backend for transcribing from VAD, and give the text to a text-to-text backend, this way we can always go back at this. There was also an interface created exactly for this so a pipeline can be kinda seen as a "drag and drop" until omni models are really capable.

However, yes audio input is actually supported by llama.cpp and our backends, try qwen2-omni, you will be able to give it an audio as input, but isn't super accurate (better transcribing for now).

@richiejp
Copy link
Collaborator Author

OK, I tried Qwen 2 omni and had issues with accuracy and context length which aren't a problem for a pipeline.

@richiejp
Copy link
Collaborator Author

richiejp commented Jan 1, 2026

#7812

@richiejp
Copy link
Collaborator Author

richiejp commented Jan 1, 2026

OpenAI made quite some changes to the API that possibly it would have been better to handle before this, but there are also changes in-flight to the Go realtime API library AFAICT. I really want to get something working, so I am just ignoring these changes for now and will have to address them afterwards.

@richiejp richiejp force-pushed the feat/realtime-audio-conv branch 2 times, most recently from 2271f01 to 915824d Compare January 7, 2026 13:50
@richiejp
Copy link
Collaborator Author

richiejp commented Jan 7, 2026

and it works. There is a long list of issues however I have the full pipeline working.

@richiejp
Copy link
Collaborator Author

richiejp commented Jan 7, 2026

To be clear probably nobody will want to use this given its current state, but we could merge it for my own experimentation and so I don't have to keep rebasing on master. Next I need to update the API to the current OpenAI GA. @mudler

@richiejp richiejp marked this pull request as ready for review January 7, 2026 13:58
@richiejp richiejp force-pushed the feat/realtime-audio-conv branch from 915824d to 91c4e02 Compare January 7, 2026 14:07
@richiejp richiejp enabled auto-merge (squash) January 7, 2026 14:07
@richiejp
Copy link
Collaborator Author

richiejp commented Jan 7, 2026

Build error: "E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/libc/libcaca/libcaca0_0.99.beta20-4ubuntu0.1_amd64.deb 404 Not Found [IP: 91.189.91.83 80]". Strange, I can download this file.

@richiejp richiejp force-pushed the feat/realtime-audio-conv branch from 91c4e02 to 37606d4 Compare January 9, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for realtime API

2 participants