feat: add remote chat transcription mode#62
feat: add remote chat transcription mode#62IgorWarzocha wants to merge 2 commits intopeteonrails:mainfrom
Conversation
Enable multimodal chat completions for remote transcription with optional system prompts and a data URI audio fallback.
Replace the internal chat payload test with a runnable script that exercises the chat completions endpoint using harvard.wav.
|
Thanks for putting this together, Igor. I appreciate the effort, especially the clean implementation and the smoketest script. I've been thinking through the use cases here, and I want to talk through something before we decide how to proceed. There's significant overlap between what chat mode offers and what voxtype's existing For the pirate-speak example in your smoketest, users can do this today: [output.post_process]
command = "ollama run llama3.2:1b 'Translate this to pirate-speech. Return ONLY the translated text:'"Same for text cleanup, summarization, or extracting structured data. The pipeline becomes: Have a look at the Swedish Chef script in [output.post_process]
command = "voxtype/examples/swedish-chef.sh"
timeout_ms = 1000The existing approach has some advantages: Whisper is purpose-built for STT, text-only LLM calls are faster and cheaper than multimodal, and post_process works with any tool (not just OpenAI-compatible APIs). That said, I don't want to dismiss this if there are use cases I'm not seeing. Is there something about sending raw audio to a multimodal LLM that post_process can't address? I can think of a few possibilities:
Were any of these part of your thinking, or is there another scenario you had in mind? Happy to keep discussing before we make a call on this. |
|
@peteonrails thanks for taking the time to reply, Pete, as I said, this is a very specific scenario that I was working on for myself, and my bot went mad and submitted to upstream! :) Definitely not for everyone, but very powerful for people who are willing to configure it. Long story short is I was trying to find a use for 1500/day gemini 2.5flash requests. It can take audio, so I hooked it up via Voxtype, since I am on Omarchy anyway and I like native tools. The difference is that if someone hooks it up to a model that can take audio directly, they do not need whisper+llm for post processing - so even though you're using a cloud service, it's generally faster and doesn't use your local resources. I understand if someone wants privacy, they won't go "to the cloud", but if someone wants an integrated audio transcription and would rather pay for the api directly rather than use an additional app, or call an llm endpoint twice, once for transcription, once for rewrite... Sending a wav file directly doesn't seem like a bad idea anymore. You basically brought up the same scenarios as I thought about, especially given I'm a bit bye-lingual at this point. To expand:
That setup is almost like a competitor for commercial apps. That being said, I will not take any offense if you just park it indefinitely, hah. PS. I also think that it's been a while since we've had a whisper model so supporting alternative voice recognition solutions is never a bad idea! |
Ugh. The model decided to push it here despite me being explicit about posting it to my origin. Anyway. Have a look if this is of interest.
Summary