-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Checked other resources
- This is a bug, not a usage question. For questions, please use the LangChain Forum (https://forum.langchain.com/).
- I added a very descriptive title to this issue.
- I searched the LangChain.js documentation with the integrated search.
- I used the GitHub search to find a similar question and didn't find it.
- I am sure that this is a bug in LangChain.js rather than my code.
- The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
Reproduction
const userInput = new HumanMessage({
content: [
{ type: 'text', text: 'a' },
{
type: 'input_audio',
input_audio: {
data: castAudioContent.data, // base64
format: 'wav',
},
},
{ type: 'text', text: 'b' },
{
type: 'input_audio',
input_audio: {
data: castAudioContent.data, // base64
format: 'wav',
},
},
],
});Error Message and Stack Trace (if applicable)
Behavior before v1.2.4
Observed in handleLLMStart:
human: [
{ type: 'text', text: 'a' },
{
type: 'input_audio',
input_audio: {
data: 'base64...',
format: 'wav',
},
},
{ type: 'text', text: 'b' },
{
type: 'input_audio',
input_audio: {
data: 'base64...',
format: 'wav',
},
},
]The multimodal structure is preserved and audio input works as expected.
Behavior since v1.2.4
Observed in handleLLMStart:
human: "ab"All input_audio segments are dropped, the content is flattened into
plain text, and audio input no longer works.
Description
Summary
Starting from LangChain v1.2.4, multimodal HumanMessage.content
(for example, mixing text and input_audio) is flattened into a plain
text string when observed in handleLLMStart.
As a result, input_audio segments are dropped entirely and audio input
no longer works.
This behavior is different from versions before v1.2.4, where the
original structured content array was preserved.
Expected Behavior
HumanMessage.contentshould preserve its original multimodal
structure inhandleLLMStart- Alternatively, a new hook or documented mechanism should be provided
to access raw multimodal prompts - If this change is intentional, an explicit migration strategy should
be documented
Impact
- Multimodal input using
HumanMessage.content[]is broken in
versions>= v1.2.4 handleLLMStartcan no longer be used to inspect or intercept
multimodal prompts- This appears to be a breaking change, but no clear migration path or
changelog entry was found
Additional Notes
This behavior suggests a prompt normalization or serialization step
introduced in v1.2.4 that concatenates text segments and ignores
non-text segments such as input_audio.
If this is an intentional design change, clarification and documentation
would be appreciated.
System Info
Environment
- LangChain version:
>= v1.2.4 - Runtime: Node.js
- Use case: Audio / multimodal input via
HumanMessage