Using ANY compatible endpoint for transcription, not just whisper-likes #60

IgorWarzocha · 2026-01-10T11:28:24Z

IgorWarzocha
Jan 10, 2026

I've been thinking... I am already wiring up gemini flash to rewrite/de-ramble my dictation.

I'm using a proxy that lets me use the CLI auth as a standard openai compatible endpoint. It theory, it's got a multimodal input that accepts audio. Would you be open to accept a draft PR that could enable this natively?

My use case is obviously a bit of a hack, but in theory, it would enable using it with any models that accept audio. Send the recording snippet, add a system prompt to rewrite, boom.

"I have developed a script that leverages the Gemini 2.5 Flash model to generate various types of transcribed prompts. For instance, one mode is designed for business communication, translating informal input into professional speech. A similar application is available for coding-related prompts."

"Utilize the Gemini API to create a developer-friendly dictation tool. - Integrate with the Gemini API. - end audio files to the API. - Incorporate a system prompt. - Develop a tool for developer-friendly dictation and rewrites."

But this is text for now. I could be just sending it audio, theoretically! :)

IgorWarzocha · 2026-01-10T12:42:44Z

IgorWarzocha
Jan 10, 2026
Author

What The Feature Does

Adds a remote “chat” mode that sends recorded audio to any OpenAI‑compatible chat endpoint that accepts multimodal audio content, and returns the model’s response as dictation output.
Supports a system prompt so the output can be transformed (translate, summarize, re‑style, etc.) instead of just raw transcription.
This uses the standard OpenAI chat schema with multimodal blocks.
Any endpoint that accepts audio directly can use input_audio.
The data‑URI image_url fallback works with proxies or providers that treat it as a generic multimodal container.

Verified to be working. Outputs amazing pirate speech by using a proper system prompt.

How It Works (Key Pieces)

WAV → base64
let audio_base64 = base64::engine::general_purpose::STANDARD.encode(wav_data);
Build multimodal user message (text + audio)

let content = if use_data_uri {
    json!([
        { "type": "text", "text": "Process this audio." },
        { "type": "image_url", "image_url": {
            "url": format!("data:audio/wav;base64,{}", audio_base64)
        }}
    ])
} else {
    json!([
        { "type": "text", "text": "Process this audio." },
        { "type": "input_audio", "input_audio": {
            "data": audio_base64, "format": "wav"
        }}
    ])
};

Assemble messages with optional system prompt

if let Some(prompt) = system_prompt {
    messages.push(json!({ "role": "system", "content": prompt }));
}
messages.push(json!({ "role": "user", "content": content }));

Send /v1/chat/completions and extract output

let response: Value = ureq::post(&url)
    .set("Content-Type", "application/json")
    .send_json(payload)?
    .into_json()?;
let text = response["choices"][0]["message"]["content"]
    .as_str()
    .unwrap_or("")
    .trim()
    .to_string();```

0 replies

peteonrails · 2026-01-28T15:50:57Z

peteonrails
Jan 28, 2026
Maintainer

@IgorWarzocha I'll be circling back to your Pull Request on this soon!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ANY compatible endpoint for transcription, not just whisper-likes #60

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using ANY compatible endpoint for transcription, not just whisper-likes #60

Uh oh!

Uh oh!

IgorWarzocha Jan 10, 2026

Replies: 2 comments

Uh oh!

Uh oh!

IgorWarzocha Jan 10, 2026 Author

Uh oh!

peteonrails Jan 28, 2026 Maintainer

IgorWarzocha
Jan 10, 2026

IgorWarzocha
Jan 10, 2026
Author

peteonrails
Jan 28, 2026
Maintainer