Fix #1833: Improve transcription response format compatibility by anditherobot · Pull Request #2088 · lemonade-sdk/lemonade

anditherobot · 2026-06-03T04:57:46Z

Summary

This PR fixes #1833 for the OpenAI-compatible transcription API.

Apps can request transcription results in different formats, including JSON, plain text, SRT, and VTT. Lemonade already accepted those requests, but plain text and subtitle formats were still wrapped in JSON instead of being returned directly. That could confuse apps expecting normal text or subtitle output.

This change makes text, srt, and vtt return raw text bodies. json, verbose_json, and errors still return JSON.

Before, a vtt response looked like this:

{
  "text": "WEBVTT\n\n00:00:00.000 --> 00:00:04.000\nListen, read, repeat.\n"
}

After this change:

WEBVTT

00:00:00.000 --> 00:00:04.000
Listen, read, repeat.

The same applies to srt.

Testing

I reviewed the full diff and kept the change scoped to transcription response formats.

I manually checked the endpoint with a WAV file using all supported response formats: json, verbose_json, text, srt, and vtt. This confirmed the old behavior: text, srt, and vtt were returned as JSON wrappers.

I ran git diff --check. The commit also passed the repo pre-commit hooks, including Black.

Documentation

Updated the OpenAI API docs to list the supported transcription response formats: json, verbose_json, text, srt, and vtt.

Fixes #1833

Make transcription responses closer to the OpenAI spec: errors stay JSON, json and verbose_json stay JSON, and text, srt, and vtt now return raw text bodies. Update docs and add response_format test coverage.

fl0rianr

Thanks, good work! The transcription behavior looks right.
But there is a blocker regarding the scope of the forward_multipart_request() change. This helper is also used by SD image edits/variations, where successful responses are expected to be JSON. With this PR, any successful non-JSON response from those backends would become {"text": ...} and be returned to clients as HTTP 200 instead of surfacing as an invalid backend response. Could we make the non-JSON-success fallback opt-in and enable it only for transcription requests with response_format set to text, srt, or vtt?

I’d also add a negative test for an unsupported response_format value so the new documented supported-values list is enforced deterministically.

ramkrishna2910 · 2026-06-03T17:14:35Z

@claude review

github-actions · 2026-06-03T17:14:50Z

Claude finished @ramkrishna2910's task in 2m 9s —— View job

Code Review: Transcription response formats (#2088)

Read the PR diff and changed files
Trace forward_multipart_request() callers and backend transcription paths
Assess @fl0rianr's blocker and the requested negative test
Post review

Nice fix — the user-facing behavior (text/srt/vtt returning raw bodies, json/verbose_json staying JSON) is correct and well-tested. I agree with @fl0rianr's blocker, and tracing the code actually makes the case stronger. Details below.

🔴 Blocker: the `wrapped_server.cpp` change is broader than transcription — and not even needed for whisper

The change at src/cpp/server/wrapped_server.cpp:196-201 adds a non-JSON→{"text": ...} fallback to the shared forward_multipart_request() helper. Its callers are:

Caller	Endpoint	Expected 200 body
`whisper_server.cpp`	—	does not use this helper (see below)
`fastflowlm_server.cpp:398`	`/v1/audio/transcriptions`	text or JSON ✅ intended
`sd_server.cpp:540`	`/v1/images/edits`	JSON ❌ side effect
`sd_server.cpp:573`	`/v1/images/edits` (variations)	JSON ❌ side effect

For SD image edits/variations, a malformed non-JSON 200 previously surfaced as a backend error (the json::parse throw was caught at wrapped_server.cpp:221 → NetworkException). With this change it silently becomes {"text": "<garbage>"} returned as HTTP 200 — exactly the regression @fl0rianr flagged.

Crucially, the whisper.cpp backend doesn't go through this helper at all — WhisperServer does its own post_multipart + parse and already wraps non-JSON itself at whisper_server.cpp:557-563:

try {
    return json::parse(res.body);
} catch (const json::parse_error&) {
    return json{{"text", res.body}};   // already handled here
}

So the only transcription backend that benefits from the forward_multipart_request() change is FastFlowLM, while the cost is changed semantics for SD.

Recommended fix (cleanest, matches existing convention): revert the wrapped_server.cpp change and instead do the try/parse/wrap locally in fastflowlm_server.cpp:398, mirroring exactly what whisper_server.cpp already does. This scopes the fallback to the one backend that needs it and leaves SD's strict-JSON contract intact — no opt-in flag needed.

// fastflowlm_server.cpp, replacing line 398
auto resp = forward_multipart_request("/v1/audio/transcriptions", fields);
// FLM returns raw text for text/srt/vtt; the helper already wrapped it as {"text": ...}
return resp;

If you'd rather keep it in the shared helper, the alternative is an opt-in bool allow_raw_text_fallback = false param (default preserves SD behavior, transcription callers pass true) — that also satisfies the blocker, but duplicating the whisper pattern in FLM is more consistent with the codebase.

Fix this →

🟡 Negative test needs server-side validation to be deterministic

@fl0rianr also asked for a negative test for an unsupported response_format. Note that there is currently no validation of response_format in handle_audio_transcriptions — an unsupported value (e.g. "foo") is just forwarded to the backend, whose behavior (error vs. default-to-json) isn't guaranteed. So a negative test today wouldn't be deterministic.

To enforce the newly-documented supported-values list, add a check in server.cpp (around server.cpp:2182, before forwarding) that returns HTTP 400 with an invalid_request_error for any response_format not in {json, verbose_json, text, srt, vtt}. Then the negative test can assert 400.

🟢 Minor / nits

server.cpp:2188 — guarding on response["text"].is_string() is good defensive code; if text is somehow missing it correctly falls through to the JSON dump. 👍
The new early-return inside the if (response.contains("error")) block (server.cpp:2174-2178) is a correct cleanup of the prior fall-through. 👍
Content-Type text/plain for srt/vtt matches OpenAI's behavior. 👍
Out of scope, but worth a glance: sd_server.cpp:540 and :573 both forward to /v1/images/edits — if :573 is the variations path, that may be a separate latent bug. Not part of this PR.

Test coverage

test_002b_transcription_response_formats is a solid positive test across all five formats. Once validation lands, please add the negative case @fl0rianr requested. Note this test requires a live whisper/FLM backend, so I couldn't execute it in this environment — the analysis above is from reading the code paths.

Overall: behavior is right, the docs update is accurate, but the shared-helper change should be narrowed to transcription before merge.
• fix/1833-audio-transcription-response-formats

Keep image responses strict so backend mistakes do not look like successful requests. Let audio transcription return plain text only when the user asks for text, srt, or vtt, and return a clear 400 error for unsupported response formats.

anditherobot · 2026-06-03T18:19:50Z

@fl0rianr @ramkrishna2910 thanks for the reviews. I updated the PR based on the feedback.

The main issue was that the first version allowed plain text responses too broadly. That was okay for audio
transcription, but it could make image backend errors look like successful responses.

I changed it so:

Image requests still expect proper JSON responses.
Audio transcription can return plain text only when the user asks for text, srt, or vtt.
Unsupported transcription formats now get a clear 400 error instead of being passed through to the backend.
I added a test for an unsupported format.

In terms of files, the shared multipart helper is now strict by default, FastFlowLM and Whisper only allow plain
text for transcription formats that expect it, and the public transcription endpoint now rejects unsupported
formats early(good performance).

The follow-up triggered by @ramkrishna2910 was useful too: it confirmed that the same helper function is used by
both audio and image requests. I used the suggested opt-in approach, so plain text is now allowed only for audio
transcription.

Verification:

Commit hooks passed.
Existing tests for the supported formats are still included.
Added the unsupported-format test.

fl0rianr · 2026-06-03T20:17:19Z

CI error is related:

FAIL: test_002b_transcription_response_formats (__main__.WhisperTests) (response_format='verbose_json')
Test audio transcription response_format handling.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\user\actions-runner\_work\lemonade\lemonade\test\server_whisper.py", line 246, in test_002b_transcription_response_formats
    self.assertIn("segments", result)
AssertionError: 'segments' not found in {'model': 'whisper-v3:turbo', 'text': ' And so, my fellow Americans, ask not what your country can do for you ask what you can do for your country.'}

----------------------------------------------------------------------
Ran 3 tests in 14.913s

FAILED (failures=1)

…ription-response-formats

fix: improve transcription response format compatibility

97e126b

Make transcription responses closer to the OpenAI spec: errors stay JSON, json and verbose_json stay JSON, and text, srt, and vtt now return raw text bodies. Update docs and add response_format test coverage.

anditherobot mentioned this pull request Jun 3, 2026

Update OpenAI Compatible /v1/audio/transcriptions with more response formats #1833

Open

fl0rianr requested changes Jun 3, 2026

View reviewed changes

jeremyfowers requested a review from ramkrishna2910 June 3, 2026 13:40

Address PR lemonade-sdk#2088 review feedback

9b00041

Keep image responses strict so backend mistakes do not look like successful requests. Let audio transcription return plain text only when the user asks for text, srt, or vtt, and return a clear 400 error for unsupported response formats.

Andi M added 3 commits June 3, 2026 19:20

Allow compact verbose transcription JSON

75787f3

Allow FastFlowLM subtitle plain text

adbef31

Merge remote-tracking branch 'origin/main' into fix/1833-audio-transc…

c294584

…ription-response-formats

github-actions Bot added audio bug Something isn't working enhancement New feature or request labels Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #1833: Improve transcription response format compatibility#2088

Fix #1833: Improve transcription response format compatibility#2088
anditherobot wants to merge 5 commits into
lemonade-sdk:mainfrom
anditherobot:fix/1833-audio-transcription-response-formats

anditherobot commented Jun 3, 2026 •

edited

Loading

Uh oh!

fl0rianr left a comment

Uh oh!

ramkrishna2910 commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

anditherobot commented Jun 3, 2026

Uh oh!

fl0rianr commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

anditherobot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Documentation

Uh oh!

fl0rianr left a comment

Choose a reason for hiding this comment

Uh oh!

ramkrishna2910 commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Transcription response formats (#2088)

🔴 Blocker: the wrapped_server.cpp change is broader than transcription — and not even needed for whisper

🟡 Negative test needs server-side validation to be deterministic

🟢 Minor / nits

Test coverage

Uh oh!

anditherobot commented Jun 3, 2026

Uh oh!

fl0rianr commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anditherobot commented Jun 3, 2026 •

edited

Loading

github-actions Bot commented Jun 3, 2026 •

edited

Loading

🔴 Blocker: the `wrapped_server.cpp` change is broader than transcription — and not even needed for whisper