Skip to content

asset_based: LLM structured output silently truncated by max_tokens, surfaces as cryptic JSON parse error #141

@jonathanzhan1975

Description

@jonathanzhan1975

Summary

When the asset_based pipeline generates a script for many assets (e.g. 9 PPT slides) using a Chinese LLM (DeepSeek), the structured-output JSON response is silently truncated by max_tokens, which then surfaces as an unhelpful Failed to parse LLM response as VideoScript: {...first 200 chars...} error. The user has no way to know the real cause is truncation.

Reproduction

  1. Open the Custom Media (自定义素材) pipeline.
  2. Upload 9 image assets with full Windows-style absolute paths (e.g. E:\works\study\Pixelle-Video\temp\assets_xxxxx\幻灯片N.PNG, ~80 chars each).
  3. Provide a Chinese article / intent and click Generate.
  4. LLM is deepseek-v4-flash (or similar Chinese-tokenizer model).

Observed log:

DEBUG llm_service:_call_with_structured_output:247 - Structured output response length: 3028 chars
ERROR llm_service:__call__:202 - LLM call error (model=deepseek-v4-flash, ...): Failed to parse LLM response as VideoScript: {
  "scenes": [
    {
      "scene_number": 1,
      "asset_path": "E:\\works\\study\\Pixelle-Video\\temp\\assets_16cd17476857\\幻灯片1.PNG",
      "narrations": [
        "2026年高考强基计划报名已经结束,但真正的战役才刚刚开始。...
ValueError: Failed to parse LLM response as VideoScript: ...

The pipeline aborts before scene 1 is produced.

Root cause

Three independent issues compound here:

1. max_tokens=4000 is too tight for 9-scene Chinese structured output

pixelle_video/pipelines/asset_based.py L342-346:

script: VideoScript = await self.core.llm(
    prompt=prompt,
    response_type=VideoScript,
    temperature=0.8,
    max_tokens=4000
)

For 9 scenes × (full Windows asset path ~80 chars + 1-5 Chinese narrations × 30-100 chars + JSON structure ~50 chars), the output easily exceeds 4000 tokens with DeepSeek's Chinese tokenizer (~1.0-1.5 token/char). The reproducer above hit 3028 chars of response — close enough to the ceiling that the JSON was cut off mid-string.

2. _call_with_structured_output never checks finish_reason

pixelle_video/services/llm_service.py L238-247:

response = await client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": enhanced_prompt}],
    temperature=temperature,
    max_tokens=max_tokens,
    **kwargs
)
content = response.choices[0].message.content

logger.debug(f"Structured output response length: {len(content)} chars")

# Parse JSON from response content
return self._parse_response_as_model(content, response_type)

OpenAI-compatible APIs return response.choices[0].finish_reason == "length" when the response was truncated by max_tokens. This code never inspects it. As a result, truncation is silently passed to the JSON parser, which then fails for unrelated-looking reasons.

3. _parse_response_as_model error message is misleading

pixelle_video/services/llm_service.py L292-320:

def _parse_response_as_model(self, content: str, response_type: Type[T]) -> T:
    # Try direct JSON parsing first
    try:
        data = json.loads(content)
        return response_type.model_validate(data)
    except json.JSONDecodeError:
        pass

    # Try extracting from markdown code block
    json_pattern = r'```(?:json)?\s*([\s\S]+?)\s*```'
    match = re.search(json_pattern, content, re.DOTALL)
    if match:
        try:
            data = json.loads(match.group(1))
            return response_type.model_validate(data)
        except json.JSONDecodeError:
            pass

    # Try to find any JSON object in the text
    brace_start = content.find('{')
    brace_end = content.rfind('}')
    if brace_start != -1 and brace_end > brace_start:
        try:
            json_str = content[brace_start:brace_end + 1]
            data = json.loads(json_str)
            return response_type.model_validate(data)
        except json.JSONDecodeError:
            pass

    raise ValueError(f"Failed to parse LLM response as {response_type.__name__}: {content[:200]}...")

Two further problems with this method:

a. The fallbacks only catch json.JSONDecodeError. If json.loads succeeds but response_type.model_validate(data) raises pydantic.ValidationError, that exception bypasses all three fallbacks and propagates with a different (less helpful) message — masking the real cause.

b. The final error message only shows content[:200]. For truncation bugs, the last 100 characters are far more diagnostic than the first 200, because that's where the truncation cliff lives. The user sees a clean-looking JSON header and is misled into thinking the LLM "returned a bad format" rather than "returned a complete-looking-but-truncated body".

Why this matters

A user uploading 9 reasonable assets (a normal use case for a slide-narration pipeline) currently gets:

  • A 30-second LLM call that costs API quota
  • Aborts with a cryptic "Failed to parse LLM response" message
  • No hint that the cure is to bump max_tokens or reduce the asset path lengths

The fix is short and doesn't require any new dependencies.

Suggested fix

A. Bump max_tokens for asset_based structured output

Single-line change in pixelle_video/pipelines/asset_based.py L346:

max_tokens=8000  # was 4000 — too tight for 9+ scenes with Chinese narrations

A more principled version would scale by asset count: max_tokens=max(4000, 1000 + 800 * len(assets)).

B. Detect finish_reason == "length" and fail loudly

In pixelle_video/services/llm_service.py::_call_with_structured_output, between L245 and L250:

choice = response.choices[0]
content = choice.message.content
finish_reason = getattr(choice, "finish_reason", None)

logger.debug(f"Structured output response length: {len(content)} chars, finish_reason={finish_reason}")

if finish_reason == "length":
    raise ValueError(
        f"LLM response truncated by max_tokens={max_tokens} "
        f"(got {len(content)} chars, finish_reason='length'). "
        f"Increase max_tokens or reduce input size."
    )

This converts the silent truncation into an explicit, actionable error.

C. Catch ValidationError and improve the parse-failure message

In _parse_response_as_model, also catch pydantic.ValidationError in the three fallbacks, and include the tail of the content in the final raise:

from pydantic import ValidationError

def _parse_response_as_model(self, content: str, response_type: Type[T]) -> T:
    parse_errors = []
    for attempt_name, candidate in self._json_candidates(content):
        try:
            data = json.loads(candidate)
            return response_type.model_validate(data)
        except (json.JSONDecodeError, ValidationError) as e:
            parse_errors.append(f"{attempt_name}: {type(e).__name__}: {e}")

    raise ValueError(
        f"Failed to parse LLM response as {response_type.__name__}.\n"
        f"  length: {len(content)} chars\n"
        f"  first 200: {content[:200]!r}\n"
        f"  last 100:  ...{content[-100:]!r}\n"
        f"  attempts: {' | '.join(parse_errors)}"
    )

(Where _json_candidates is a small helper that yields the three current strategies.)

Out of scope

  • Whether the prompt should ever produce paths in the output at all is a separate design question. Echoing 9 × 80-char Windows paths back into the response is wasteful regardless. A future improvement would be to have the LLM emit scene_index referencing an upstream asset list, instead of repeating the full path.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions