Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions PR_DESCRIPTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
## Summary

Migrates OpenAI native API calls from the deprecated `chat.completions.create` endpoint to the new `responses.create` endpoint as recommended by OpenAI.

Fixes #11624

## Changes

### Core Changes
1. Updated OpenAI provider in `llm.py` to use `client.responses.create()`
2. Added `extract_responses_api_reasoning()` helper to parse reasoning output (handles both string and array summary formats)
3. Added `extract_responses_api_tool_calls()` helper to parse function calls
4. Added error handling for API errors (matching Anthropic provider pattern)
5. Extract system messages to `instructions` parameter (Responses API requirement)

### Parameter Mapping (Chat Completions → Responses API)
1. `messages` → `input` (non-system messages only)
2. System messages → `instructions` parameter
3. `max_completion_tokens` → `max_output_tokens`
4. `response_format={...}` → `text={"format":{...}}`

### Response Parsing (Chat Completions → Responses API)
1. `choices[0].message.content` → `output_text`
2. `usage.prompt_tokens` → `usage.input_tokens`
3. `usage.completion_tokens` → `usage.output_tokens`
4. `choices[0].message.tool_calls` → `output` items with `type="function_call"`

## Compatibility

### SDK Version
1. **Required:** openai >= 1.66.0 (Responses API added in [v1.66.0](https://github.com/openai/openai-python/releases/tag/v1.66.0))
2. **AutoGPT uses:** ^1.97.1 (COMPATIBLE)

### API Compatibility
1. `llm_call()` function signature - UNCHANGED
2. `LLMResponse` class structure - UNCHANGED
3. Return type and fields - UNCHANGED

### Provider Impact
1. `openai` - YES, modified (Native OpenAI - uses Responses API)
2. `anthropic` - NO (Different SDK entirely)
3. `groq` - NO (Third-party API, Chat Completions compatible)
4. `open_router` - NO (Third-party API, Chat Completions compatible)
5. `llama_api` - NO (Third-party API, Chat Completions compatible)
6. `ollama` - NO (Uses ollama SDK)
7. `aiml_api` - NO (Third-party API, Chat Completions compatible)
8. `v0` - NO (Third-party API, Chat Completions compatible)

### Dependent Blocks Verified
1. `smart_decision_maker.py` (Line 508) - Uses: response, tool_calls, prompt_tokens, completion_tokens, reasoning - COMPATIBLE
2. `ai_condition.py` (Line 113) - Uses: response, prompt_tokens, completion_tokens, prompt - COMPATIBLE
3. `perplexity.py` - Does not use llm_call (uses different API) - NOT AFFECTED

### Streaming Service
`backend/server/v2/chat/service.py` is NOT affected - it uses OpenRouter by default which requires Chat Completions API format.

## Testing

### Test File Updates
1. Updated `test_llm.py` mocks to use `output_text` instead of `choices[0].message.content`
2. Updated mocks to use `output` array for tool calls
3. Updated mocks to use `usage.input_tokens` / `usage.output_tokens`

### Verification Performed
1. SDK version compatibility verified (1.97.1 > 1.66.0)
2. Function signature unchanged
3. LLMResponse class unchanged
4. All 7 other providers unchanged
5. Dependent blocks use only public API
6. Streaming service unaffected (uses OpenRouter)
7. Error handling matches Anthropic provider pattern
8. Tool call extraction handles `call_id` with fallback to `id`
9. Reasoning extraction handles both string and array `summary` formats

### Recommended Manual Testing
1. Test with GPT-4o model using native OpenAI API
2. Test with tool/function calling enabled
3. Test with JSON mode (`force_json_output=True`)
4. Verify token counting works correctly

## Files Modified

### 1. `autogpt_platform/backend/backend/blocks/llm.py`
1. Added `extract_responses_api_reasoning()` helper
2. Added `extract_responses_api_tool_calls()` helper
3. Updated OpenAI provider section to use `responses.create`
4. Added error handling with try/except
5. Extract system messages to `instructions` parameter

### 2. `autogpt_platform/backend/backend/blocks/test/test_llm.py`
1. Updated mocks for Responses API format

## References

1. [OpenAI Responses API Docs](https://platform.openai.com/docs/api-reference/responses)
2. [OpenAI Function Calling Docs](https://platform.openai.com/docs/guides/function-calling)
3. [OpenAI Reasoning Docs](https://platform.openai.com/docs/guides/reasoning)
4. [Simon Willison's Comparison](https://simonwillison.net/2025/Mar/11/responses-vs-chat-completions/)
5. [OpenAI Python SDK v1.66.0 Release](https://github.com/openai/openai-python/releases/tag/v1.66.0)

## Checklist

### Changes
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
- [x] Updated unit test mocks to use Responses API format
- [x] Verified function signature unchanged
- [x] Verified LLMResponse class unchanged
- [x] Verified dependent blocks compatible
- [x] Verified other providers unchanged

### Code Quality
- [x] My code follows the project's style guidelines
- [x] I have commented my code where necessary
- [x] My changes generate no new warnings
- [x] I have added error handling matching existing patterns
126 changes: 105 additions & 21 deletions autogpt_platform/backend/backend/blocks/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -362,8 +362,7 @@ def convert_openai_tool_fmt_to_anthropic(


def extract_openai_reasoning(response) -> str | None:
"""Extract reasoning from OpenAI-compatible response if available."""
"""Note: This will likely not working since the reasoning is not present in another Response API"""
"""Extract reasoning from OpenAI Chat Completions response if available."""
reasoning = None
choice = response.choices[0]
if hasattr(choice, "reasoning") and getattr(choice, "reasoning", None):
Expand All @@ -378,7 +377,7 @@ def extract_openai_reasoning(response) -> str | None:


def extract_openai_tool_calls(response) -> list[ToolContentBlock] | None:
"""Extract tool calls from OpenAI-compatible response."""
"""Extract tool calls from OpenAI Chat Completions response."""
if response.choices[0].message.tool_calls:
return [
ToolContentBlock(
Expand All @@ -394,9 +393,47 @@ def extract_openai_tool_calls(response) -> list[ToolContentBlock] | None:
return None


def extract_responses_api_reasoning(response) -> str | None:
"""Extract reasoning from OpenAI Responses API response if available.

The summary field can be either a string or an array of summary items,
so we handle both cases appropriately.
"""
# The Responses API stores reasoning in output items with type "reasoning"
if hasattr(response, "output") and response.output:
for item in response.output:
if hasattr(item, "type") and item.type == "reasoning":
if hasattr(item, "summary") and item.summary:
# Handle both string and array summary formats
if isinstance(item.summary, list):
# Join array items into a single string
return " ".join(str(s) for s in item.summary if s)
return str(item.summary)
return None


def extract_responses_api_tool_calls(response) -> list[ToolContentBlock] | None:
"""Extract tool calls from OpenAI Responses API response."""
tool_calls = []
if hasattr(response, "output") and response.output:
for item in response.output:
if hasattr(item, "type") and item.type == "function_call":
tool_calls.append(
ToolContentBlock(
id=getattr(item, "call_id", getattr(item, "id", "")),
type="function",
function=ToolCall(
name=item.name,
arguments=item.arguments,
),
)
)
return tool_calls if tool_calls else None


def get_parallel_tool_calls_param(
llm_model: LlmModel, parallel_tool_calls: bool | None
):
) -> bool | openai.NotGiven:
"""Get the appropriate parallel_tool_calls parameter for OpenAI-compatible APIs."""
if llm_model.startswith("o") or parallel_tool_calls is None:
return openai.NOT_GIVEN
Expand Down Expand Up @@ -454,34 +491,81 @@ async def llm_call(
if provider == "openai":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 HIGH - God Function with 329 lines handling 8 different LLM providers
Agent: architecture

Category: quality

Description:
The llm_call function is a monolithic function with completely separate logic paths for 8 LLM providers. Each provider has its own client instantiation, request building, response parsing, error handling, and token counting. This violates Single Responsibility Principle.

Suggestion:
Refactor using Strategy pattern: create an abstract LLMProvider base class with provider-specific implementations. Use a factory to instantiate the correct provider based on model metadata.

Why this matters: Framework coupling makes code harder to test and migrate.

Confidence: 75%
Rule: py_separate_business_logic_from_framework
Review ID: 856e49d7-82a0-43b7-8634-e05aaca4b5a6
Rate it 👍 or 👎 to improve future reviews | Powered by diffray

tools_param = tools if tools else openai.NOT_GIVEN
oai_client = openai.AsyncOpenAI(api_key=credentials.api_key.get_secret_value())
response_format = None

parallel_tool_calls = get_parallel_tool_calls_param(
parallel_tool_calls_param = get_parallel_tool_calls_param(
llm_model, parallel_tool_calls
)

# Extract system messages for instructions parameter
system_messages = [p["content"] for p in prompt if p["role"] == "system"]
instructions = " ".join(system_messages) if system_messages else None

# Filter out system messages for input (Responses API expects them in instructions)
input_messages = [p for p in prompt if p["role"] != "system"]
Comment on lines +500 to +504

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM - Repeated list comprehension filters for system messages
Agent: performance

Category: performance

Description:
Lines 500 and 504 both iterate through the prompt list separately with list comprehensions.

Suggestion:
Combine into single loop to extract both system_messages and input_messages in one pass.

Confidence: 65%
Rule: perf_quadratic_loops
Review ID: 856e49d7-82a0-43b7-8634-e05aaca4b5a6
Rate it 👍 or 👎 to improve future reviews | Powered by diffray


# Build Responses API parameters
responses_params: dict = {
"model": llm_model.value,
"input": input_messages,
"max_output_tokens": max_tokens,
}

if instructions:
responses_params["instructions"] = instructions

if tools_param is not openai.NOT_GIVEN:
responses_params["tools"] = tools_param
if parallel_tool_calls_param is not openai.NOT_GIVEN:
responses_params["parallel_tool_calls"] = parallel_tool_calls_param

if force_json_output:
response_format = {"type": "json_object"}
responses_params["text"] = {"format": {"type": "json_object"}}

response = await oai_client.chat.completions.create(
model=llm_model.value,
messages=prompt, # type: ignore
response_format=response_format, # type: ignore
max_completion_tokens=max_tokens,
tools=tools_param, # type: ignore
parallel_tool_calls=parallel_tool_calls,
)
try:
response = await oai_client.responses.create(
**responses_params, timeout=600
)
except openai.APIError as e:
error_message = (
f"OpenAI Responses API error for model {llm_model.value}: {str(e)}"
)
logger.error(error_message)
raise ValueError(error_message) from e
except TimeoutError as e:
error_message = f"OpenAI Responses API timeout for model {llm_model.value}"
logger.error(error_message)
raise ValueError(error_message) from e

tool_calls = extract_openai_tool_calls(response)
reasoning = extract_openai_reasoning(response)
tool_calls = extract_responses_api_tool_calls(response)
reasoning = extract_responses_api_reasoning(response)

# Build a message dict for raw_response that matches the expected format
# for conversation history (role, content, and optionally tool_calls)
raw_response_dict: dict = {
"role": "assistant",
"content": response.output_text or "",
}
# Add tool_calls in OpenAI format if present
if tool_calls:
raw_response_dict["tool_calls"] = [
{
"id": tc.id,
"type": "function",
"function": {
"name": tc.function.name,
"arguments": tc.function.arguments,
},
}
for tc in tool_calls
]

return LLMResponse(
raw_response=response.choices[0].message,
raw_response=raw_response_dict,
prompt=prompt,
response=response.choices[0].message.content or "",
response=response.output_text or "",
tool_calls=tool_calls,
prompt_tokens=response.usage.prompt_tokens if response.usage else 0,
completion_tokens=response.usage.completion_tokens if response.usage else 0,
prompt_tokens=response.usage.input_tokens if response.usage else 0,
completion_tokens=response.usage.output_tokens if response.usage else 0,
reasoning=reasoning,
)
elif provider == "anthropic":
Expand Down
46 changes: 12 additions & 34 deletions autogpt_platform/backend/backend/blocks/test/test_llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,17 @@ async def test_llm_call_returns_token_counts(self):
"""Test that llm_call returns proper token counts in LLMResponse."""
import backend.blocks.llm as llm

# Mock the OpenAI client
# Mock the OpenAI Responses API response
mock_response = MagicMock()
mock_response.choices = [
MagicMock(message=MagicMock(content="Test response", tool_calls=None))
]
mock_response.usage = MagicMock(prompt_tokens=10, completion_tokens=20)
mock_response.output_text = "Test response"
mock_response.output = [] # No tool calls
mock_response.usage = MagicMock(input_tokens=10, output_tokens=20)

# Test with mocked OpenAI response
# Test with mocked OpenAI Responses API
with patch("openai.AsyncOpenAI") as mock_openai:
mock_client = AsyncMock()
mock_openai.return_value = mock_client
mock_client.chat.completions.create = AsyncMock(return_value=mock_response)
mock_client.responses.create = AsyncMock(return_value=mock_response)

response = await llm.llm_call(
credentials=llm.TEST_CREDENTIALS,
Expand All @@ -41,8 +40,6 @@ async def test_llm_call_returns_token_counts(self):
@pytest.mark.asyncio
async def test_ai_structured_response_block_tracks_stats(self):
"""Test that AIStructuredResponseGeneratorBlock correctly tracks stats."""
from unittest.mock import patch

import backend.blocks.llm as llm

block = llm.AIStructuredResponseGeneratorBlock()
Expand Down Expand Up @@ -255,13 +252,11 @@ async def mock_llm_call(input_data, credentials):
@pytest.mark.asyncio
async def test_ai_text_summarizer_real_llm_call_stats(self):
"""Test AITextSummarizer with real LLM call mocking to verify llm_call_count."""
from unittest.mock import AsyncMock, MagicMock, patch

import backend.blocks.llm as llm

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM - Misleading test name suggests live network calls but uses mocks
Agent: microservices

Category: quality

Description:
The test is named 'test_ai_text_summarizer_real_llm_call_stats' which suggests it makes real LLM calls. However, it uses mocks throughout. This naming creates ambiguity about test isolation.

Suggestion:
Rename to 'test_ai_text_summarizer_mocked_llm_call_stats' or 'test_ai_text_summarizer_llm_call_stats' to accurately reflect it uses mocks.

Why this matters: Live I/O introduces slowness, nondeterminism, and external failures unrelated to the code.

Confidence: 65%
Rule: gen_no_live_io_in_unit_tests
Review ID: 856e49d7-82a0-43b7-8634-e05aaca4b5a6
Rate it 👍 or 👎 to improve future reviews | Powered by diffray

block = llm.AITextSummarizerBlock()

# Mock the actual LLM call instead of the llm_call method
# Mock the actual LLM call using Responses API format
call_count = 0

async def mock_create(*args, **kwargs):
Expand All @@ -271,30 +266,17 @@ async def mock_create(*args, **kwargs):
mock_response = MagicMock()
# Return different responses for chunk summary vs final summary
if call_count == 1:
mock_response.choices = [
MagicMock(
message=MagicMock(
content='<json_output id="test123456">{"summary": "Test chunk summary"}</json_output>',
tool_calls=None,
)
)
]
mock_response.output_text = '<json_output id="test123456">{"summary": "Test chunk summary"}</json_output>'
else:
mock_response.choices = [
MagicMock(
message=MagicMock(
content='<json_output id="test123456">{"final_summary": "Test final summary"}</json_output>',
tool_calls=None,
)
)
]
mock_response.usage = MagicMock(prompt_tokens=50, completion_tokens=30)
mock_response.output_text = '<json_output id="test123456">{"final_summary": "Test final summary"}</json_output>'
mock_response.output = [] # No tool calls
mock_response.usage = MagicMock(input_tokens=50, output_tokens=30)
return mock_response

with patch("openai.AsyncOpenAI") as mock_openai:
mock_client = AsyncMock()
mock_openai.return_value = mock_client
mock_client.chat.completions.create = mock_create
mock_client.responses.create = mock_create

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 HIGH - Inconsistent mock setup for async HTTP calls
Agent: microservices

Category: bug

Description:
In test_ai_text_summarizer_real_llm_call_stats, the async function mock_create is assigned directly to mock_client.responses.create without wrapping in AsyncMock. This is inconsistent with line 26 which uses AsyncMock(return_value=mock_response).

Suggestion:
Change line 283 from 'mock_client.responses.create = mock_create' to 'mock_client.responses.create = AsyncMock(side_effect=mock_create)' for consistent mocking behavior and proper call tracking.

Why this matters: Live I/O introduces slowness, nondeterminism, and external failures unrelated to the code.

Confidence: 70%
Rule: gen_no_live_io_in_unit_tests
Review ID: 856e49d7-82a0-43b7-8634-e05aaca4b5a6
Rate it 👍 or 👎 to improve future reviews | Powered by diffray


# Test with very short text (should only need 1 chunk + 1 final summary)
input_data = llm.AITextSummarizerBlock.Input(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM - Debug print statements in test code
Agent: python

Category: quality

Description:
Debug print() statements left in test function. Should be removed or replaced with logging.

Suggestion:
Replace with logger.debug() calls or use pytest's caplog fixture, or remove if no longer needed.

Confidence: 85%
Rule: python_print_debug
Review ID: 856e49d7-82a0-43b7-8634-e05aaca4b5a6
Rate it 👍 or 👎 to improve future reviews | Powered by diffray

Expand All @@ -312,10 +294,6 @@ async def mock_create(*args, **kwargs):
):
outputs[output_name] = output_data

print(f"Actual calls made: {call_count}")
print(f"Block stats: {block.execution_stats}")
print(f"LLM call count: {block.execution_stats.llm_call_count}")

# Should have made 2 calls: 1 for chunk summary + 1 for final summary
assert block.execution_stats.llm_call_count >= 1
assert block.execution_stats.input_token_count > 0
Expand Down
Loading