Skip to content

fix-openai-toolcall-after-thinking #20333#20725

Open
martinalupini wants to merge 5 commits intorun-llama:mainfrom
martinalupini:main
Open

fix-openai-toolcall-after-thinking #20333#20725
martinalupini wants to merge 5 commits intorun-llama:mainfrom
martinalupini:main

Conversation

@martinalupini
Copy link

Description

Fixes #20333

This PR fixes an issue in OpenAIResponses where reasoning items were serialized
as ID references inside to_openai_responses_message_dict().

  • When store=False, reasoning items are not persisted server-side,
    causing subsequent tool calls referencing those IDs to fail with a 400 error.

  • When store=True, reasoning items were not structured according to the
    Responses API requirements, leading to validation errors.

The fix omits reasoning items when converting a ChatMessage to an OpenAI message dict,
preventing invalid ID references and allowing tool calls to work correctly
after a reasoning step. The motivation behind this choice is that reasoning items represent internal model artifacts and are
not part of the conversational history. They should not be propagated
across requests or re-injected into the input history.

Behavior Change

This PR updates the serialization logic so that reasoning blocks are
completely ignored during conversion to the OpenAI message dict.

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Feb 17, 2026
Copy link
Member

@AstraBert AstraBert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was taking a look at the OpenAI responses API reference for creating a response, and a ResponseReasoningItem is supported as an input. I would prefer that we adapt the store=True/False behavior and use the ResponseInputItem as input if store = True, rather than dropping the reasoning :)

@martinalupini
Copy link
Author

Hi @AstraBert

Thanks for the suggestion.

After investigating further, the issue turns out not to be only related to store=True/False, but to the sequence requirements imposed by the Responses API when reasoning and tool calls are combined.

Even when store=True, the problem remains because a reasoning item must be immediately followed by the assistant item it refers to. The API expects a structure like:

[
{ "type": "reasoning", ... },
{ "role": "assistant", "content": ... }
]

However, our current implementation, when both reasoning and tool calls are present, returns:

[
{ "type": "reasoning", ... },
{ "type": "function_call", ... }
]

The problem is that a function_call item is not considered a valid assistant message following a reasoning item. As a result, even with store=True, the API raises:

"Item 'rs_...' of type 'reasoning' was provided without its required following item."

So the root cause is the structure of the returned sequence, not just persistence.

A structurally valid alternative would be to return an assistant message that contains the tool calls, instead of returning standalone function_call items. For example, in the function to_openai_responses_message_dicts() from llama_index.llms.openai.utils:

elif tool_calls:
    assistant_message = {
        "role": "assistant",
        "content": None,
        "tool_calls": tool_calls
    }
    if reasoning:
        return [*reasoning, assistant_message]
    return assistant_message

instead of the current:

elif tool_calls:
    return [*reasoning, *tool_calls]

But this seems more extreme and invasive than dropping the reasoning, which is useless in conversational history.

Let me know what you think about this alternative or if you have other suggestions. Thank you for your suggestions tho! :)

@AstraBert
Copy link
Member

AstraBert commented Feb 19, 2026

The thing I am not so sure about is that reasoning items are useless to the conversation history. OpenAI API reference (linked above) describes the ReasoningItem as: "A description of the chain of thought used by a reasoning model while generating a response. Be sure to include these items in your input to the Responses API for subsequent turns of a conversation if you are manually managing context". My feeling (I did the implementation of the reasoning-to-thinking-block and vice versa in the first place) is that we are missing some critical pieces when we collect outputs from the Responses API, critical pieces that would solve this issue without having to completely drop the thinking: if you want to try and take a look at that, feel free, otherwise I am more than happy to take this on :)

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Feb 19, 2026
@martinalupini
Copy link
Author

martinalupini commented Feb 19, 2026

Thank you again for your feedback.

Your point that we are missing some pieces when collecting the outputs from the Responses API definitely makes sense. I also understand your perspective on not removing the reasoning.
Another aspect that might be worth considering is a possible structural issue related to how items are returned and in which sequence, as I mentioned in my previous message. What do you think about that?

In the meantime, I updated the code to propagate the store information and to explicitly omit reasoning items when store=False, to avoid the 400 error caused by non-persisted IDs. I left the code unchanged in its original form when store=True.

Regarding the case where both reasoning and tool calls are present, I found an example in the official documentation:

[
  {
    "id": "rs_6890ed2b6374819dbbff5353e6664ef103f4db9848be4829",
    "type": "reasoning",
    "content": [],
    "summary": []
  },
  {
    "id": "ctc_6890ed2f32e8819daa62bef772b8c15503f4db9848be4829",
    "type": "custom_tool_call",
    "status": "completed",
    "call_id": "call_pmlLjmvG33KJdyVdC4MVdk5N",
    "input": "4 + 4",
    "name": "math_exp"
  }
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: OpenAI Responses API - Tool Call After Thinking Fails with 400 Error

2 participants

Comments