feat: add remote chat transcription mode by IgorWarzocha · Pull Request #62 · peteonrails/voxtype

IgorWarzocha · 2026-01-10T13:20:45Z

Ugh. The model decided to push it here despite me being explicit about posting it to my origin. Anyway. Have a look if this is of interest.

Summary

add remote chat transcription mode with system prompts and audio data URI fallback
reuse existing WAV encoding to send audio in multimodal chat payloads
add a chat payload test using harvard.wav

Enable multimodal chat completions for remote transcription with optional system prompts and a data URI audio fallback.

Replace the internal chat payload test with a runnable script that exercises the chat completions endpoint using harvard.wav.

peteonrails · 2026-01-13T17:41:43Z

Thanks for putting this together, Igor. I appreciate the effort, especially the clean implementation and the smoketest script.

I've been thinking through the use cases here, and I want to talk through something before we decide how to proceed. There's significant overlap between what chat mode offers and what voxtype's existing post_process hook already supports.

For the pirate-speak example in your smoketest, users can do this today:

[output.post_process]
command = "ollama run llama3.2:1b 'Translate this to pirate-speech. Return ONLY the translated text:'"

Same for text cleanup, summarization, or extracting structured data. The pipeline becomes:

Audio → Whisper → Text → post_process → LLM → Output

Have a look at the Swedish Chef script in examples/swedish-chef.sh that does dialect transformation with pure sed, no LLM needed:

[output.post_process]
command = "voxtype/examples/swedish-chef.sh"
timeout_ms = 1000

The existing approach has some advantages: Whisper is purpose-built for STT, text-only LLM calls are faster and cheaper than multimodal, and post_process works with any tool (not just OpenAI-compatible APIs).

That said, I don't want to dismiss this if there are use cases I'm not seeing. Is there something about sending raw audio to a multimodal LLM that post_process can't address? I can think of a few possibilities:

Picking up tonal nuance that gets lost in transcription
Working around Whisper transcription errors by letting the LLM hear the original
Languages or accents where multimodal models outperform Whisper

Were any of these part of your thinking, or is there another scenario you had in mind? Happy to keep discussing before we make a call on this.

IgorWarzocha · 2026-01-13T21:01:26Z

@peteonrails thanks for taking the time to reply, Pete, as I said, this is a very specific scenario that I was working on for myself, and my bot went mad and submitted to upstream! :) Definitely not for everyone, but very powerful for people who are willing to configure it.

Long story short is I was trying to find a use for 1500/day gemini 2.5flash requests. It can take audio, so I hooked it up via Voxtype, since I am on Omarchy anyway and I like native tools. The difference is that if someone hooks it up to a model that can take audio directly, they do not need whisper+llm for post processing - so even though you're using a cloud service, it's generally faster and doesn't use your local resources.

I understand if someone wants privacy, they won't go "to the cloud", but if someone wants an integrated audio transcription and would rather pay for the api directly rather than use an additional app, or call an llm endpoint twice, once for transcription, once for rewrite... Sending a wav file directly doesn't seem like a bad idea anymore.

You basically brought up the same scenarios as I thought about, especially given I'm a bit bye-lingual at this point. To expand:

A big cloud model will always be better at translation/rewrites. That being said, a big, unquantised version of Whisper might be better at just transcription - unsure about that.
I found that it's easier for the model to understand what is going on when you just send it audio. I can, for example, just say something along the lines of "(...) ugh, I'm missing a word - put the smart word for X (...)" or mix languages, and the model will understand what to do with it.
Very nerdy and very power-user. But we're talking Omarchy. I'll just show you the benchmarks, up to you if you wanna dig deeper into the config, but TLDR, I've got a setup that uses G2.5 flash directly to perform more advanced dictation like converting ramblings into coding prompts or corporate linkedin speech. I tested it and made it react to "natural language corrections". The test here was done via text, as a part of prompt-engineering my setup, but it works the same with audio. I don't believe these complex setups are possible using separate transcription and text models https://github.com/IgorWarzocha/voxtype-beast-mode/blob/master/benchmarks/ultimate_benchmark_gemini_flash_results.md

That setup is almost like a competitor for commercial apps. That being said, I will not take any offense if you just park it indefinitely, hah.

PS. I also think that it's been a while since we've had a whisper model so supporting alternative voice recognition solutions is never a bad idea!

feat: add remote chat transcription mode

deae992

Enable multimodal chat completions for remote transcription with optional system prompts and a data URI audio fallback.

IgorWarzocha closed this Jan 10, 2026

IgorWarzocha reopened this Jan 10, 2026

test: add remote chat smoketest script

76b83b2

Replace the internal chat payload test with a runnable script that exercises the chat completions endpoint using harvard.wav.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add remote chat transcription mode#62

feat: add remote chat transcription mode#62
IgorWarzocha wants to merge 2 commits intopeteonrails:mainfrom
IgorWarzocha:feat/remote-chat-mode

IgorWarzocha commented Jan 10, 2026 •

edited

Loading

Uh oh!

peteonrails commented Jan 13, 2026 •

edited

Loading

Uh oh!

IgorWarzocha commented Jan 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

IgorWarzocha commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

peteonrails commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IgorWarzocha commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IgorWarzocha commented Jan 10, 2026 •

edited

Loading

peteonrails commented Jan 13, 2026 •

edited

Loading

IgorWarzocha commented Jan 13, 2026 •

edited

Loading