Conversation
… var Add DEFAULT_MAX_TOKENS environment variable (default: 2048, preserving existing behaviour) so operators can tune the fallback max_tokens without code changes. Changes: - setting.py: Add DEFAULT_MAX_TOKENS env var - schema.py: Import DEFAULT_MAX_TOKENS; use it as ChatRequest.max_tokens default - bedrock.py: Compute effective_max_tokens preferring max_completion_tokens (OpenAI newer field) over max_tokens (legacy), and use it consistently in inference_config and reasoning budget_tokens calculation The effective_max_tokens logic ensures that clients sending max_completion_tokens (the newer OpenAI field) are handled correctly, while the env var gives operators control over the default when neither field is specified by the client.
The comment already explains the intent (prefer max_completion_tokens over max_tokens). Naming a specific client adds no value and will become stale.
zxkane
left a comment
There was a problem hiding this comment.
Thanks for the PR! The refactoring to consolidate effective_max_tokens and eliminate duplicate logic is a nice improvement.
The main concern is around the default value for max_tokens. Per the Bedrock InferenceConfiguration docs, maxTokens is optional and defaults to the model's maximum when omitted. The OpenAI API also treats max_tokens as optional. Hardcoding 2048 artificially caps every model's output — for example, Claude Sonnet 4 supports 8192 and Opus 4 supports 32768 output tokens.
Rather than making the hardcoded 2048 configurable, the better fix would be to change the default to None and only include maxTokens in the Bedrock request when explicitly set by the client. This aligns with both the OpenAI API and Bedrock API behavior.
See inline comments for details.
|
|
||
| DEBUG = os.environ.get("DEBUG", "false").lower() != "false" | ||
| AWS_REGION = os.environ.get("AWS_REGION", "us-west-2") | ||
| DEFAULT_MAX_TOKENS = int(os.environ.get("DEFAULT_MAX_TOKENS", "2048")) |
There was a problem hiding this comment.
A couple of concerns here:
1. Should the default really be 2048?
Per the Bedrock InferenceConfiguration docs, maxTokens is optional — when omitted, it defaults to the maximum allowed value for the model. The OpenAI API also treats max_tokens as optional (defaults to model max).
Hardcoding 2048 artificially caps every model's output. For example, Claude Sonnet 4's max output is 8192 tokens and Opus 4 supports up to 32768. Users who don't explicitly set max_tokens would get truncated responses without realizing why.
A better default would be None, which lets Bedrock use the model's native max — consistent with both the OpenAI API and the Bedrock API:
# Let Bedrock use the model's native max output tokens when not specified
max_tokens: int | None = NoneThen in _parse_request, only include maxTokens in inference_config when it's not None.
2. Unguarded int() conversion
If a user sets DEFAULT_MAX_TOKENS=abc or DEFAULT_MAX_TOKENS= (empty string), the application will crash at import time with an unhelpful ValueError. This is the only int() cast in the settings module. Consider wrapping with try/except and validating >= 1.
| top_p: float | None = Field(default=None, le=1.0, ge=0.0) | ||
| user: str | None = None # Not used | ||
| max_tokens: int | None = 2048 | ||
| max_tokens: int | None = DEFAULT_MAX_TOKENS |
There was a problem hiding this comment.
Related to the comment on setting.py — since maxTokens is optional in the Bedrock Converse API (defaults to model max), and max_tokens is also optional in the OpenAI API, the default here should arguably be None rather than DEFAULT_MAX_TOKENS (2048).
Also, other numeric fields in this model use Field() with validation constraints (e.g., temperature, top_p). Consider adding the same for consistency:
max_tokens: int | None = Field(default=None, ge=1)
max_completion_tokens: int | None = Field(default=None, ge=1)This would give users clear Pydantic validation errors for invalid values (e.g., max_tokens: 0 or max_tokens: -1) instead of opaque Bedrock API errors.
| # Prefer max_completion_tokens (OpenAI newer field) over max_tokens (legacy). | ||
| effective_max_tokens = ( | ||
| chat_request.max_completion_tokens | ||
| if chat_request.max_completion_tokens is not None | ||
| else chat_request.max_tokens | ||
| ) | ||
| inference_config = { | ||
| "maxTokens": chat_request.max_tokens, | ||
| "maxTokens": effective_max_tokens, | ||
| } |
There was a problem hiding this comment.
The refactoring to consolidate effective_max_tokens is a nice improvement. A couple of things to note:
1. Behavior change: falsy check → is not None check
The old code in the Claude reasoning block used if chat_request.max_completion_tokens (falsy — treats 0 as False). The new code uses is not None (treats 0 as a valid value). This is more correct, but it's a subtle behavior change that could surface as a regression if any client sends max_completion_tokens: 0.
2. effective_max_tokens can be None
If max_tokens defaults to None (as suggested above) and the client doesn't send max_completion_tokens either, effective_max_tokens will be None. In that case, we should omit maxTokens from inference_config entirely so Bedrock uses the model default, rather than sending maxTokens: None which would cause a ParamValidationError:
effective_max_tokens = (
chat_request.max_completion_tokens
if chat_request.max_completion_tokens is not None
else chat_request.max_tokens
)
inference_config = {}
if effective_max_tokens is not None:
inference_config["maxTokens"] = effective_max_tokens| ) | ||
| inference_config["maxTokens"] = max_tokens | ||
| inference_config["maxTokens"] = effective_max_tokens | ||
| # unset topP - Not supported |
There was a problem hiding this comment.
Minor: inference_config["maxTokens"] = effective_max_tokens on line 832 is now redundant — it was already set to the same value on line 785. In the old code this re-assignment was needed because a different local max_tokens was computed here, but after your refactoring both use effective_max_tokens. Could be removed for clarity.
*Issue #227 *
Description of changes:
DEFAULT_MAX_TOKENSenvironment variable insetting.py(default:2048to preserve existing behaviour)max_tokenseffective_max_tokensin_parse_requestthat prefersmax_completion_tokensovermax_tokens, eliminating duplicate logic:Files:
src/api/setting.py,src/api/schema.py,src/api/models/bedrock.pyBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.