Skip to content

Fix #1833: Improve transcription response format compatibility#2088

Open
anditherobot wants to merge 5 commits into
lemonade-sdk:mainfrom
anditherobot:fix/1833-audio-transcription-response-formats
Open

Fix #1833: Improve transcription response format compatibility#2088
anditherobot wants to merge 5 commits into
lemonade-sdk:mainfrom
anditherobot:fix/1833-audio-transcription-response-formats

Conversation

@anditherobot
Copy link
Copy Markdown
Contributor

@anditherobot anditherobot commented Jun 3, 2026

Summary

This PR fixes #1833 for the OpenAI-compatible transcription API.

Apps can request transcription results in different formats, including JSON, plain text, SRT, and VTT. Lemonade already accepted those requests, but plain text and subtitle formats were still wrapped in JSON instead of being returned directly. That could confuse apps expecting normal text or subtitle output.

This change makes text, srt, and vtt return raw text bodies. json, verbose_json, and errors still return JSON.

Before, a vtt response looked like this:

{
  "text": "WEBVTT\n\n00:00:00.000 --> 00:00:04.000\nListen, read, repeat.\n"
}

After this change:

WEBVTT

00:00:00.000 --> 00:00:04.000
Listen, read, repeat.

The same applies to srt.

Testing

I reviewed the full diff and kept the change scoped to transcription response formats.

I manually checked the endpoint with a WAV file using all supported response formats: json, verbose_json, text, srt, and vtt. This confirmed the old behavior: text, srt, and vtt were returned as JSON wrappers.

I ran git diff --check. The commit also passed the repo pre-commit hooks, including Black.

Documentation

Updated the OpenAI API docs to list the supported transcription response formats: json, verbose_json, text, srt, and vtt.

Fixes #1833

Make transcription responses closer to the OpenAI spec: errors stay JSON, json and verbose_json stay JSON, and text, srt, and vtt now return raw text bodies. Update docs and add response_format test coverage.
Copy link
Copy Markdown
Collaborator

@fl0rianr fl0rianr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good work! The transcription behavior looks right.
But there is a blocker regarding the scope of the forward_multipart_request() change. This helper is also used by SD image edits/variations, where successful responses are expected to be JSON. With this PR, any successful non-JSON response from those backends would become {"text": ...} and be returned to clients as HTTP 200 instead of surfacing as an invalid backend response. Could we make the non-JSON-success fallback opt-in and enable it only for transcription requests with response_format set to text, srt, or vtt?

I’d also add a negative test for an unsupported response_format value so the new documented supported-values list is enforced deterministically.

@ramkrishna2910
Copy link
Copy Markdown
Contributor

@claude review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Claude finished @ramkrishna2910's task in 2m 9s —— View job


Code Review: Transcription response formats (#2088)

  • Read the PR diff and changed files
  • Trace forward_multipart_request() callers and backend transcription paths
  • Assess @fl0rianr's blocker and the requested negative test
  • Post review

Nice fix — the user-facing behavior (text/srt/vtt returning raw bodies, json/verbose_json staying JSON) is correct and well-tested. I agree with @fl0rianr's blocker, and tracing the code actually makes the case stronger. Details below.

🔴 Blocker: the wrapped_server.cpp change is broader than transcription — and not even needed for whisper

The change at src/cpp/server/wrapped_server.cpp:196-201 adds a non-JSON→{"text": ...} fallback to the shared forward_multipart_request() helper. Its callers are:

Caller Endpoint Expected 200 body
whisper_server.cpp does not use this helper (see below)
fastflowlm_server.cpp:398 /v1/audio/transcriptions text or JSON ✅ intended
sd_server.cpp:540 /v1/images/edits JSON ❌ side effect
sd_server.cpp:573 /v1/images/edits (variations) JSON ❌ side effect

For SD image edits/variations, a malformed non-JSON 200 previously surfaced as a backend error (the json::parse throw was caught at wrapped_server.cpp:221NetworkException). With this change it silently becomes {"text": "<garbage>"} returned as HTTP 200 — exactly the regression @fl0rianr flagged.

Crucially, the whisper.cpp backend doesn't go through this helper at allWhisperServer does its own post_multipart + parse and already wraps non-JSON itself at whisper_server.cpp:557-563:

try {
    return json::parse(res.body);
} catch (const json::parse_error&) {
    return json{{"text", res.body}};   // already handled here
}

So the only transcription backend that benefits from the forward_multipart_request() change is FastFlowLM, while the cost is changed semantics for SD.

Recommended fix (cleanest, matches existing convention): revert the wrapped_server.cpp change and instead do the try/parse/wrap locally in fastflowlm_server.cpp:398, mirroring exactly what whisper_server.cpp already does. This scopes the fallback to the one backend that needs it and leaves SD's strict-JSON contract intact — no opt-in flag needed.

// fastflowlm_server.cpp, replacing line 398
auto resp = forward_multipart_request("/v1/audio/transcriptions", fields);
// FLM returns raw text for text/srt/vtt; the helper already wrapped it as {"text": ...}
return resp;

If you'd rather keep it in the shared helper, the alternative is an opt-in bool allow_raw_text_fallback = false param (default preserves SD behavior, transcription callers pass true) — that also satisfies the blocker, but duplicating the whisper pattern in FLM is more consistent with the codebase.

Fix this →

🟡 Negative test needs server-side validation to be deterministic

@fl0rianr also asked for a negative test for an unsupported response_format. Note that there is currently no validation of response_format in handle_audio_transcriptions — an unsupported value (e.g. "foo") is just forwarded to the backend, whose behavior (error vs. default-to-json) isn't guaranteed. So a negative test today wouldn't be deterministic.

To enforce the newly-documented supported-values list, add a check in server.cpp (around server.cpp:2182, before forwarding) that returns HTTP 400 with an invalid_request_error for any response_format not in {json, verbose_json, text, srt, vtt}. Then the negative test can assert 400.

🟢 Minor / nits

  • server.cpp:2188 — guarding on response["text"].is_string() is good defensive code; if text is somehow missing it correctly falls through to the JSON dump. 👍
  • The new early-return inside the if (response.contains("error")) block (server.cpp:2174-2178) is a correct cleanup of the prior fall-through. 👍
  • Content-Type text/plain for srt/vtt matches OpenAI's behavior. 👍
  • Out of scope, but worth a glance: sd_server.cpp:540 and :573 both forward to /v1/images/edits — if :573 is the variations path, that may be a separate latent bug. Not part of this PR.

Test coverage

test_002b_transcription_response_formats is a solid positive test across all five formats. Once validation lands, please add the negative case @fl0rianr requested. Note this test requires a live whisper/FLM backend, so I couldn't execute it in this environment — the analysis above is from reading the code paths.

Overall: behavior is right, the docs update is accurate, but the shared-helper change should be narrowed to transcription before merge.
fix/1833-audio-transcription-response-formats

Keep image responses strict so backend mistakes do not look like successful requests. Let audio transcription return plain text only when the user asks for text, srt, or vtt, and return a clear 400 error for unsupported response formats.
@anditherobot
Copy link
Copy Markdown
Contributor Author

@fl0rianr @ramkrishna2910 thanks for the reviews. I updated the PR based on the feedback.

The main issue was that the first version allowed plain text responses too broadly. That was okay for audio
transcription, but it could make image backend errors look like successful responses.

I changed it so:

  • Image requests still expect proper JSON responses.
  • Audio transcription can return plain text only when the user asks for text, srt, or vtt.
  • Unsupported transcription formats now get a clear 400 error instead of being passed through to the backend.
  • I added a test for an unsupported format.

In terms of files, the shared multipart helper is now strict by default, FastFlowLM and Whisper only allow plain
text for transcription formats that expect it, and the public transcription endpoint now rejects unsupported
formats early(good performance).

The follow-up triggered by @ramkrishna2910 was useful too: it confirmed that the same helper function is used by
both audio and image requests. I used the suggested opt-in approach, so plain text is now allowed only for audio
transcription.

Verification:

  • Commit hooks passed.
  • Existing tests for the supported formats are still included.
  • Added the unsupported-format test.

@fl0rianr
Copy link
Copy Markdown
Collaborator

fl0rianr commented Jun 3, 2026

CI error is related:

FAIL: test_002b_transcription_response_formats (__main__.WhisperTests) (response_format='verbose_json')
Test audio transcription response_format handling.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\user\actions-runner\_work\lemonade\lemonade\test\server_whisper.py", line 246, in test_002b_transcription_response_formats
    self.assertIn("segments", result)
AssertionError: 'segments' not found in {'model': 'whisper-v3:turbo', 'text': ' And so, my fellow Americans, ask not what your country can do for you ask what you can do for your country.'}

----------------------------------------------------------------------
Ran 3 tests in 14.913s

FAILED (failures=1)

@github-actions github-actions Bot added audio bug Something isn't working enhancement New feature or request labels Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audio bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update OpenAI Compatible /v1/audio/transcriptions with more response formats

3 participants