Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 10 additions & 8 deletions src/api/models/bedrock.py
Original file line number Diff line number Diff line change
Expand Up @@ -774,8 +774,14 @@ def _parse_request(self, chat_request: ChatRequest) -> dict:
system_prompts = self._parse_system_prompts(chat_request)

# Base inference parameters.
# Prefer max_completion_tokens (OpenAI newer field) over max_tokens (legacy).
effective_max_tokens = (
chat_request.max_completion_tokens
if chat_request.max_completion_tokens is not None
else chat_request.max_tokens
)
inference_config = {
"maxTokens": chat_request.max_tokens,
"maxTokens": effective_max_tokens,
}
Comment on lines +777 to 785
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactoring to consolidate effective_max_tokens is a nice improvement. A couple of things to note:

1. Behavior change: falsy check → is not None check

The old code in the Claude reasoning block used if chat_request.max_completion_tokens (falsy — treats 0 as False). The new code uses is not None (treats 0 as a valid value). This is more correct, but it's a subtle behavior change that could surface as a regression if any client sends max_completion_tokens: 0.

2. effective_max_tokens can be None

If max_tokens defaults to None (as suggested above) and the client doesn't send max_completion_tokens either, effective_max_tokens will be None. In that case, we should omit maxTokens from inference_config entirely so Bedrock uses the model default, rather than sending maxTokens: None which would cause a ParamValidationError:

effective_max_tokens = (
    chat_request.max_completion_tokens
    if chat_request.max_completion_tokens is not None
    else chat_request.max_tokens
)
inference_config = {}
if effective_max_tokens is not None:
    inference_config["maxTokens"] = effective_max_tokens


# Only include optional parameters when specified
Expand Down Expand Up @@ -818,15 +824,11 @@ def _parse_request(self, chat_request: ChatRequest) -> dict:

if "anthropic.claude" in model_lower:
# Claude format: reasoning_config = object with budget_tokens
max_tokens = (
chat_request.max_completion_tokens
if chat_request.max_completion_tokens
else chat_request.max_tokens
)
# effective_max_tokens already prefers max_completion_tokens over max_tokens
budget_tokens = self._calc_budget_tokens(
max_tokens, chat_request.reasoning_effort
effective_max_tokens, chat_request.reasoning_effort
)
inference_config["maxTokens"] = max_tokens
inference_config["maxTokens"] = effective_max_tokens
# unset topP - Not supported
Comment on lines 830 to 832
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: inference_config["maxTokens"] = effective_max_tokens on line 832 is now redundant — it was already set to the same value on line 785. In the old code this re-assignment was needed because a different local max_tokens was computed here, but after your refactoring both use effective_max_tokens. Could be removed for clarity.

inference_config.pop("topP", None)

Expand Down
4 changes: 2 additions & 2 deletions src/api/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

from pydantic import BaseModel, Field

from api.setting import DEFAULT_MODEL
from api.setting import DEFAULT_MAX_TOKENS, DEFAULT_MODEL


class Model(BaseModel):
Expand Down Expand Up @@ -106,7 +106,7 @@ class ChatRequest(BaseModel):
temperature: float | None = Field(default=None, le=2.0, ge=0.0)
top_p: float | None = Field(default=None, le=1.0, ge=0.0)
user: str | None = None # Not used
max_tokens: int | None = 2048
max_tokens: int | None = DEFAULT_MAX_TOKENS
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the comment on setting.py — since maxTokens is optional in the Bedrock Converse API (defaults to model max), and max_tokens is also optional in the OpenAI API, the default here should arguably be None rather than DEFAULT_MAX_TOKENS (2048).

Also, other numeric fields in this model use Field() with validation constraints (e.g., temperature, top_p). Consider adding the same for consistency:

max_tokens: int | None = Field(default=None, ge=1)
max_completion_tokens: int | None = Field(default=None, ge=1)

This would give users clear Pydantic validation errors for invalid values (e.g., max_tokens: 0 or max_tokens: -1) instead of opaque Bedrock API errors.

max_completion_tokens: int | None = None
reasoning_effort: Literal["low", "medium", "high"] | None = None
n: int | None = 1 # Not used
Expand Down
1 change: 1 addition & 0 deletions src/api/setting.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

DEBUG = os.environ.get("DEBUG", "false").lower() != "false"
AWS_REGION = os.environ.get("AWS_REGION", "us-west-2")
DEFAULT_MAX_TOKENS = int(os.environ.get("DEFAULT_MAX_TOKENS", "2048"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of concerns here:

1. Should the default really be 2048?

Per the Bedrock InferenceConfiguration docs, maxTokens is optional — when omitted, it defaults to the maximum allowed value for the model. The OpenAI API also treats max_tokens as optional (defaults to model max).

Hardcoding 2048 artificially caps every model's output. For example, Claude Sonnet 4's max output is 8192 tokens and Opus 4 supports up to 32768. Users who don't explicitly set max_tokens would get truncated responses without realizing why.

A better default would be None, which lets Bedrock use the model's native max — consistent with both the OpenAI API and the Bedrock API:

# Let Bedrock use the model's native max output tokens when not specified
max_tokens: int | None = None

Then in _parse_request, only include maxTokens in inference_config when it's not None.

2. Unguarded int() conversion

If a user sets DEFAULT_MAX_TOKENS=abc or DEFAULT_MAX_TOKENS= (empty string), the application will crash at import time with an unhelpful ValueError. This is the only int() cast in the settings module. Consider wrapping with try/except and validating >= 1.

DEFAULT_MODEL = os.environ.get("DEFAULT_MODEL", "anthropic.claude-3-sonnet-20240229-v1:0")
DEFAULT_EMBEDDING_MODEL = os.environ.get("DEFAULT_EMBEDDING_MODEL", "cohere.embed-multilingual-v3")
ENABLE_CROSS_REGION_INFERENCE = os.environ.get("ENABLE_CROSS_REGION_INFERENCE", "true").lower() != "false"
Expand Down