Skip to content

Feature/configurable max tokens#228

Open
dcox761 wants to merge 2 commits intoaws-samples:mainfrom
bluecrystalsolutions:feature/configurable-max-tokens
Open

Feature/configurable max tokens#228
dcox761 wants to merge 2 commits intoaws-samples:mainfrom
bluecrystalsolutions:feature/configurable-max-tokens

Conversation

@dcox761
Copy link
Copy Markdown

@dcox761 dcox761 commented Mar 3, 2026

*Issue #227 *

Description of changes:

  1. Add DEFAULT_MAX_TOKENS environment variable in setting.py (default: 2048 to preserve existing behaviour)
  2. Use it as the schema default for max_tokens
  3. Compute effective_max_tokens in _parse_request that prefers max_completion_tokens over max_tokens, eliminating duplicate logic:
effective_max_tokens = (
    chat_request.max_completion_tokens
    if chat_request.max_completion_tokens is not None
    else chat_request.max_tokens
)
inference_config = {"maxTokens": effective_max_tokens}
# Example: raise default for reasoning models
export DEFAULT_MAX_TOKENS=16384

Files: src/api/setting.py, src/api/schema.py, src/api/models/bedrock.py
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

bcsdcox added 2 commits March 3, 2026 14:10
… var

Add DEFAULT_MAX_TOKENS environment variable (default: 2048, preserving
existing behaviour) so operators can tune the fallback max_tokens without
code changes.

Changes:
- setting.py: Add DEFAULT_MAX_TOKENS env var
- schema.py: Import DEFAULT_MAX_TOKENS; use it as ChatRequest.max_tokens default
- bedrock.py: Compute effective_max_tokens preferring max_completion_tokens
  (OpenAI newer field) over max_tokens (legacy), and use it consistently
  in inference_config and reasoning budget_tokens calculation

The effective_max_tokens logic ensures that clients sending
max_completion_tokens (the newer OpenAI field) are handled correctly,
while the env var gives operators control over the default when neither
field is specified by the client.
The comment already explains the intent (prefer max_completion_tokens
over max_tokens). Naming a specific client adds no value and will
become stale.
Copy link
Copy Markdown
Member

@zxkane zxkane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The refactoring to consolidate effective_max_tokens and eliminate duplicate logic is a nice improvement.

The main concern is around the default value for max_tokens. Per the Bedrock InferenceConfiguration docs, maxTokens is optional and defaults to the model's maximum when omitted. The OpenAI API also treats max_tokens as optional. Hardcoding 2048 artificially caps every model's output — for example, Claude Sonnet 4 supports 8192 and Opus 4 supports 32768 output tokens.

Rather than making the hardcoded 2048 configurable, the better fix would be to change the default to None and only include maxTokens in the Bedrock request when explicitly set by the client. This aligns with both the OpenAI API and Bedrock API behavior.

See inline comments for details.


DEBUG = os.environ.get("DEBUG", "false").lower() != "false"
AWS_REGION = os.environ.get("AWS_REGION", "us-west-2")
DEFAULT_MAX_TOKENS = int(os.environ.get("DEFAULT_MAX_TOKENS", "2048"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of concerns here:

1. Should the default really be 2048?

Per the Bedrock InferenceConfiguration docs, maxTokens is optional — when omitted, it defaults to the maximum allowed value for the model. The OpenAI API also treats max_tokens as optional (defaults to model max).

Hardcoding 2048 artificially caps every model's output. For example, Claude Sonnet 4's max output is 8192 tokens and Opus 4 supports up to 32768. Users who don't explicitly set max_tokens would get truncated responses without realizing why.

A better default would be None, which lets Bedrock use the model's native max — consistent with both the OpenAI API and the Bedrock API:

# Let Bedrock use the model's native max output tokens when not specified
max_tokens: int | None = None

Then in _parse_request, only include maxTokens in inference_config when it's not None.

2. Unguarded int() conversion

If a user sets DEFAULT_MAX_TOKENS=abc or DEFAULT_MAX_TOKENS= (empty string), the application will crash at import time with an unhelpful ValueError. This is the only int() cast in the settings module. Consider wrapping with try/except and validating >= 1.

top_p: float | None = Field(default=None, le=1.0, ge=0.0)
user: str | None = None # Not used
max_tokens: int | None = 2048
max_tokens: int | None = DEFAULT_MAX_TOKENS
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the comment on setting.py — since maxTokens is optional in the Bedrock Converse API (defaults to model max), and max_tokens is also optional in the OpenAI API, the default here should arguably be None rather than DEFAULT_MAX_TOKENS (2048).

Also, other numeric fields in this model use Field() with validation constraints (e.g., temperature, top_p). Consider adding the same for consistency:

max_tokens: int | None = Field(default=None, ge=1)
max_completion_tokens: int | None = Field(default=None, ge=1)

This would give users clear Pydantic validation errors for invalid values (e.g., max_tokens: 0 or max_tokens: -1) instead of opaque Bedrock API errors.

Comment on lines +777 to 785
# Prefer max_completion_tokens (OpenAI newer field) over max_tokens (legacy).
effective_max_tokens = (
chat_request.max_completion_tokens
if chat_request.max_completion_tokens is not None
else chat_request.max_tokens
)
inference_config = {
"maxTokens": chat_request.max_tokens,
"maxTokens": effective_max_tokens,
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactoring to consolidate effective_max_tokens is a nice improvement. A couple of things to note:

1. Behavior change: falsy check → is not None check

The old code in the Claude reasoning block used if chat_request.max_completion_tokens (falsy — treats 0 as False). The new code uses is not None (treats 0 as a valid value). This is more correct, but it's a subtle behavior change that could surface as a regression if any client sends max_completion_tokens: 0.

2. effective_max_tokens can be None

If max_tokens defaults to None (as suggested above) and the client doesn't send max_completion_tokens either, effective_max_tokens will be None. In that case, we should omit maxTokens from inference_config entirely so Bedrock uses the model default, rather than sending maxTokens: None which would cause a ParamValidationError:

effective_max_tokens = (
    chat_request.max_completion_tokens
    if chat_request.max_completion_tokens is not None
    else chat_request.max_tokens
)
inference_config = {}
if effective_max_tokens is not None:
    inference_config["maxTokens"] = effective_max_tokens

Comment on lines 830 to 832
)
inference_config["maxTokens"] = max_tokens
inference_config["maxTokens"] = effective_max_tokens
# unset topP - Not supported
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: inference_config["maxTokens"] = effective_max_tokens on line 832 is now redundant — it was already set to the same value on line 785. In the old code this re-assignment was needed because a different local max_tokens was computed here, but after your refactoring both use effective_max_tokens. Could be removed for clarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants