Skip to content

feat: add remote chat transcription mode#62

Open
IgorWarzocha wants to merge 2 commits intopeteonrails:mainfrom
IgorWarzocha:feat/remote-chat-mode
Open

feat: add remote chat transcription mode#62
IgorWarzocha wants to merge 2 commits intopeteonrails:mainfrom
IgorWarzocha:feat/remote-chat-mode

Conversation

@IgorWarzocha
Copy link
Contributor

@IgorWarzocha IgorWarzocha commented Jan 10, 2026

Ugh. The model decided to push it here despite me being explicit about posting it to my origin. Anyway. Have a look if this is of interest.

Summary

  • add remote chat transcription mode with system prompts and audio data URI fallback
  • reuse existing WAV encoding to send audio in multimodal chat payloads
  • add a chat payload test using harvard.wav

Enable multimodal chat completions for remote transcription with optional system prompts and a data URI audio fallback.
Replace the internal chat payload test with a runnable script that exercises the chat completions endpoint using harvard.wav.
@peteonrails
Copy link
Owner

peteonrails commented Jan 13, 2026

Thanks for putting this together, Igor. I appreciate the effort, especially the clean implementation and the smoketest script.

I've been thinking through the use cases here, and I want to talk through something before we decide how to proceed. There's significant overlap between what chat mode offers and what voxtype's existing post_process hook already supports.

For the pirate-speak example in your smoketest, users can do this today:

[output.post_process]
command = "ollama run llama3.2:1b 'Translate this to pirate-speech. Return ONLY the translated text:'"

Same for text cleanup, summarization, or extracting structured data. The pipeline becomes:

Audio → Whisper → Text → post_process → LLM → Output

Have a look at the Swedish Chef script in examples/swedish-chef.sh that does dialect transformation with pure sed, no LLM needed:

[output.post_process]
command = "voxtype/examples/swedish-chef.sh"
timeout_ms = 1000

The existing approach has some advantages: Whisper is purpose-built for STT, text-only LLM calls are faster and cheaper than multimodal, and post_process works with any tool (not just OpenAI-compatible APIs).

That said, I don't want to dismiss this if there are use cases I'm not seeing. Is there something about sending raw audio to a multimodal LLM that post_process can't address? I can think of a few possibilities:

  • Picking up tonal nuance that gets lost in transcription
  • Working around Whisper transcription errors by letting the LLM hear the original
  • Languages or accents where multimodal models outperform Whisper

Were any of these part of your thinking, or is there another scenario you had in mind? Happy to keep discussing before we make a call on this.

@IgorWarzocha
Copy link
Contributor Author

IgorWarzocha commented Jan 13, 2026

@peteonrails thanks for taking the time to reply, Pete, as I said, this is a very specific scenario that I was working on for myself, and my bot went mad and submitted to upstream! :) Definitely not for everyone, but very powerful for people who are willing to configure it.

Long story short is I was trying to find a use for 1500/day gemini 2.5flash requests. It can take audio, so I hooked it up via Voxtype, since I am on Omarchy anyway and I like native tools. The difference is that if someone hooks it up to a model that can take audio directly, they do not need whisper+llm for post processing - so even though you're using a cloud service, it's generally faster and doesn't use your local resources.

I understand if someone wants privacy, they won't go "to the cloud", but if someone wants an integrated audio transcription and would rather pay for the api directly rather than use an additional app, or call an llm endpoint twice, once for transcription, once for rewrite... Sending a wav file directly doesn't seem like a bad idea anymore.

You basically brought up the same scenarios as I thought about, especially given I'm a bit bye-lingual at this point. To expand:

  1. A big cloud model will always be better at translation/rewrites. That being said, a big, unquantised version of Whisper might be better at just transcription - unsure about that.
  2. I found that it's easier for the model to understand what is going on when you just send it audio. I can, for example, just say something along the lines of "(...) ugh, I'm missing a word - put the smart word for X (...)" or mix languages, and the model will understand what to do with it.
  3. Very nerdy and very power-user. But we're talking Omarchy. I'll just show you the benchmarks, up to you if you wanna dig deeper into the config, but TLDR, I've got a setup that uses G2.5 flash directly to perform more advanced dictation like converting ramblings into coding prompts or corporate linkedin speech. I tested it and made it react to "natural language corrections". The test here was done via text, as a part of prompt-engineering my setup, but it works the same with audio. I don't believe these complex setups are possible using separate transcription and text models https://github.com/IgorWarzocha/voxtype-beast-mode/blob/master/benchmarks/ultimate_benchmark_gemini_flash_results.md

That setup is almost like a competitor for commercial apps. That being said, I will not take any offense if you just park it indefinitely, hah.

PS. I also think that it's been a while since we've had a whisper model so supporting alternative voice recognition solutions is never a bad idea!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants