Fix #1833: Improve transcription response format compatibility#2088
Fix #1833: Improve transcription response format compatibility#2088anditherobot wants to merge 5 commits into
Conversation
Make transcription responses closer to the OpenAI spec: errors stay JSON, json and verbose_json stay JSON, and text, srt, and vtt now return raw text bodies. Update docs and add response_format test coverage.
fl0rianr
left a comment
There was a problem hiding this comment.
Thanks, good work! The transcription behavior looks right.
But there is a blocker regarding the scope of the forward_multipart_request() change. This helper is also used by SD image edits/variations, where successful responses are expected to be JSON. With this PR, any successful non-JSON response from those backends would become {"text": ...} and be returned to clients as HTTP 200 instead of surfacing as an invalid backend response. Could we make the non-JSON-success fallback opt-in and enable it only for transcription requests with response_format set to text, srt, or vtt?
I’d also add a negative test for an unsupported response_format value so the new documented supported-values list is enforced deterministically.
|
@claude review |
|
Claude finished @ramkrishna2910's task in 2m 9s —— View job Code Review: Transcription response formats (#2088)
Nice fix — the user-facing behavior ( 🔴 Blocker: the
|
| Caller | Endpoint | Expected 200 body |
|---|---|---|
whisper_server.cpp |
— | does not use this helper (see below) |
fastflowlm_server.cpp:398 |
/v1/audio/transcriptions |
text or JSON ✅ intended |
sd_server.cpp:540 |
/v1/images/edits |
JSON ❌ side effect |
sd_server.cpp:573 |
/v1/images/edits (variations) |
JSON ❌ side effect |
For SD image edits/variations, a malformed non-JSON 200 previously surfaced as a backend error (the json::parse throw was caught at wrapped_server.cpp:221 → NetworkException). With this change it silently becomes {"text": "<garbage>"} returned as HTTP 200 — exactly the regression @fl0rianr flagged.
Crucially, the whisper.cpp backend doesn't go through this helper at all — WhisperServer does its own post_multipart + parse and already wraps non-JSON itself at whisper_server.cpp:557-563:
try {
return json::parse(res.body);
} catch (const json::parse_error&) {
return json{{"text", res.body}}; // already handled here
}So the only transcription backend that benefits from the forward_multipart_request() change is FastFlowLM, while the cost is changed semantics for SD.
Recommended fix (cleanest, matches existing convention): revert the wrapped_server.cpp change and instead do the try/parse/wrap locally in fastflowlm_server.cpp:398, mirroring exactly what whisper_server.cpp already does. This scopes the fallback to the one backend that needs it and leaves SD's strict-JSON contract intact — no opt-in flag needed.
// fastflowlm_server.cpp, replacing line 398
auto resp = forward_multipart_request("/v1/audio/transcriptions", fields);
// FLM returns raw text for text/srt/vtt; the helper already wrapped it as {"text": ...}
return resp;If you'd rather keep it in the shared helper, the alternative is an opt-in bool allow_raw_text_fallback = false param (default preserves SD behavior, transcription callers pass true) — that also satisfies the blocker, but duplicating the whisper pattern in FLM is more consistent with the codebase.
🟡 Negative test needs server-side validation to be deterministic
@fl0rianr also asked for a negative test for an unsupported response_format. Note that there is currently no validation of response_format in handle_audio_transcriptions — an unsupported value (e.g. "foo") is just forwarded to the backend, whose behavior (error vs. default-to-json) isn't guaranteed. So a negative test today wouldn't be deterministic.
To enforce the newly-documented supported-values list, add a check in server.cpp (around server.cpp:2182, before forwarding) that returns HTTP 400 with an invalid_request_error for any response_format not in {json, verbose_json, text, srt, vtt}. Then the negative test can assert 400.
🟢 Minor / nits
server.cpp:2188— guarding onresponse["text"].is_string()is good defensive code; iftextis somehow missing it correctly falls through to the JSON dump. 👍- The new early-
returninside theif (response.contains("error"))block (server.cpp:2174-2178) is a correct cleanup of the prior fall-through. 👍 - Content-Type
text/plainforsrt/vttmatches OpenAI's behavior. 👍 - Out of scope, but worth a glance:
sd_server.cpp:540and:573both forward to/v1/images/edits— if:573is the variations path, that may be a separate latent bug. Not part of this PR.
Test coverage
test_002b_transcription_response_formats is a solid positive test across all five formats. Once validation lands, please add the negative case @fl0rianr requested. Note this test requires a live whisper/FLM backend, so I couldn't execute it in this environment — the analysis above is from reading the code paths.
Overall: behavior is right, the docs update is accurate, but the shared-helper change should be narrowed to transcription before merge.
• fix/1833-audio-transcription-response-formats
Keep image responses strict so backend mistakes do not look like successful requests. Let audio transcription return plain text only when the user asks for text, srt, or vtt, and return a clear 400 error for unsupported response formats.
|
@fl0rianr @ramkrishna2910 thanks for the reviews. I updated the PR based on the feedback. The main issue was that the first version allowed plain text responses too broadly. That was okay for audio I changed it so:
In terms of files, the shared multipart helper is now strict by default, FastFlowLM and Whisper only allow plain The follow-up triggered by @ramkrishna2910 was useful too: it confirmed that the same helper function is used by Verification:
|
|
CI error is related: |
Summary
This PR fixes #1833 for the OpenAI-compatible transcription API.
Apps can request transcription results in different formats, including JSON, plain text, SRT, and VTT. Lemonade already accepted those requests, but plain text and subtitle formats were still wrapped in JSON instead of being returned directly. That could confuse apps expecting normal text or subtitle output.
This change makes
text,srt, andvttreturn raw text bodies.json,verbose_json, and errors still return JSON.Before, a
vttresponse looked like this:{ "text": "WEBVTT\n\n00:00:00.000 --> 00:00:04.000\nListen, read, repeat.\n" }After this change:
The same applies to
srt.Testing
I reviewed the full diff and kept the change scoped to transcription response formats.
I manually checked the endpoint with a WAV file using all supported response formats:
json,verbose_json,text,srt, andvtt. This confirmed the old behavior:text,srt, andvttwere returned as JSON wrappers.I ran
git diff --check. The commit also passed the repo pre-commit hooks, including Black.Documentation
Updated the OpenAI API docs to list the supported transcription response formats:
json,verbose_json,text,srt, andvtt.Fixes #1833