Skip to content

🐛 Regression in v1.2.4: Multimodal input_audio in HumanMessage is flattened into text #9811

@bing6

Description

@bing6

Checked other resources

  • This is a bug, not a usage question. For questions, please use the LangChain Forum (https://forum.langchain.com/).
  • I added a very descriptive title to this issue.
  • I searched the LangChain.js documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain.js rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Reproduction

const userInput = new HumanMessage({
  content: [
    { type: 'text', text: 'a' },
    {
      type: 'input_audio',
      input_audio: {
        data: castAudioContent.data, // base64
        format: 'wav',
      },
    },
    { type: 'text', text: 'b' },
    {
      type: 'input_audio',
      input_audio: {
        data: castAudioContent.data, // base64
        format: 'wav',
      },
    },
  ],
});

Error Message and Stack Trace (if applicable)

Behavior before v1.2.4

Observed in handleLLMStart:

human: [
  { type: 'text', text: 'a' },
  {
    type: 'input_audio',
    input_audio: {
      data: 'base64...',
      format: 'wav',
    },
  },
  { type: 'text', text: 'b' },
  {
    type: 'input_audio',
    input_audio: {
      data: 'base64...',
      format: 'wav',
    },
  },
]

The multimodal structure is preserved and audio input works as expected.


Behavior since v1.2.4

Observed in handleLLMStart:

human: "ab"

All input_audio segments are dropped, the content is flattened into
plain text, and audio input no longer works.

Description

Summary

Starting from LangChain v1.2.4, multimodal HumanMessage.content
(for example, mixing text and input_audio) is flattened into a plain
text string when observed in handleLLMStart.

As a result, input_audio segments are dropped entirely and audio input
no longer works.

This behavior is different from versions before v1.2.4, where the
original structured content array was preserved.

Expected Behavior

  • HumanMessage.content should preserve its original multimodal
    structure in handleLLMStart
  • Alternatively, a new hook or documented mechanism should be provided
    to access raw multimodal prompts
  • If this change is intentional, an explicit migration strategy should
    be documented

Impact

  • Multimodal input using HumanMessage.content[] is broken in
    versions >= v1.2.4
  • handleLLMStart can no longer be used to inspect or intercept
    multimodal prompts
  • This appears to be a breaking change, but no clear migration path or
    changelog entry was found

Additional Notes

This behavior suggests a prompt normalization or serialization step
introduced in v1.2.4 that concatenates text segments and ignores
non-text segments such as input_audio.

If this is an intentional design change, clarification and documentation
would be appreciated.

System Info

Environment

  • LangChain version: >= v1.2.4
  • Runtime: Node.js
  • Use case: Audio / multimodal input via HumanMessage

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions