-
Notifications
You must be signed in to change notification settings - Fork 286
Feature/configurable max tokens #228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -774,8 +774,14 @@ def _parse_request(self, chat_request: ChatRequest) -> dict: | |
| system_prompts = self._parse_system_prompts(chat_request) | ||
|
|
||
| # Base inference parameters. | ||
| # Prefer max_completion_tokens (OpenAI newer field) over max_tokens (legacy). | ||
| effective_max_tokens = ( | ||
| chat_request.max_completion_tokens | ||
| if chat_request.max_completion_tokens is not None | ||
| else chat_request.max_tokens | ||
| ) | ||
| inference_config = { | ||
| "maxTokens": chat_request.max_tokens, | ||
| "maxTokens": effective_max_tokens, | ||
| } | ||
|
|
||
| # Only include optional parameters when specified | ||
|
|
@@ -818,15 +824,11 @@ def _parse_request(self, chat_request: ChatRequest) -> dict: | |
|
|
||
| if "anthropic.claude" in model_lower: | ||
| # Claude format: reasoning_config = object with budget_tokens | ||
| max_tokens = ( | ||
| chat_request.max_completion_tokens | ||
| if chat_request.max_completion_tokens | ||
| else chat_request.max_tokens | ||
| ) | ||
| # effective_max_tokens already prefers max_completion_tokens over max_tokens | ||
| budget_tokens = self._calc_budget_tokens( | ||
| max_tokens, chat_request.reasoning_effort | ||
| effective_max_tokens, chat_request.reasoning_effort | ||
| ) | ||
| inference_config["maxTokens"] = max_tokens | ||
| inference_config["maxTokens"] = effective_max_tokens | ||
| # unset topP - Not supported | ||
|
Comment on lines
830
to
832
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Minor: |
||
| inference_config.pop("topP", None) | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,7 +3,7 @@ | |
|
|
||
| from pydantic import BaseModel, Field | ||
|
|
||
| from api.setting import DEFAULT_MODEL | ||
| from api.setting import DEFAULT_MAX_TOKENS, DEFAULT_MODEL | ||
|
|
||
|
|
||
| class Model(BaseModel): | ||
|
|
@@ -106,7 +106,7 @@ class ChatRequest(BaseModel): | |
| temperature: float | None = Field(default=None, le=2.0, ge=0.0) | ||
| top_p: float | None = Field(default=None, le=1.0, ge=0.0) | ||
| user: str | None = None # Not used | ||
| max_tokens: int | None = 2048 | ||
| max_tokens: int | None = DEFAULT_MAX_TOKENS | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Related to the comment on Also, other numeric fields in this model use max_tokens: int | None = Field(default=None, ge=1)
max_completion_tokens: int | None = Field(default=None, ge=1)This would give users clear Pydantic validation errors for invalid values (e.g., |
||
| max_completion_tokens: int | None = None | ||
| reasoning_effort: Literal["low", "medium", "high"] | None = None | ||
| n: int | None = 1 # Not used | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,6 +11,7 @@ | |
|
|
||
| DEBUG = os.environ.get("DEBUG", "false").lower() != "false" | ||
| AWS_REGION = os.environ.get("AWS_REGION", "us-west-2") | ||
| DEFAULT_MAX_TOKENS = int(os.environ.get("DEFAULT_MAX_TOKENS", "2048")) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A couple of concerns here: 1. Should the default really be 2048? Per the Bedrock InferenceConfiguration docs, Hardcoding A better default would be # Let Bedrock use the model's native max output tokens when not specified
max_tokens: int | None = NoneThen in 2. Unguarded If a user sets |
||
| DEFAULT_MODEL = os.environ.get("DEFAULT_MODEL", "anthropic.claude-3-sonnet-20240229-v1:0") | ||
| DEFAULT_EMBEDDING_MODEL = os.environ.get("DEFAULT_EMBEDDING_MODEL", "cohere.embed-multilingual-v3") | ||
| ENABLE_CROSS_REGION_INFERENCE = os.environ.get("ENABLE_CROSS_REGION_INFERENCE", "true").lower() != "false" | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactoring to consolidate
effective_max_tokensis a nice improvement. A couple of things to note:1. Behavior change: falsy check →
is not NonecheckThe old code in the Claude reasoning block used
if chat_request.max_completion_tokens(falsy — treats0as False). The new code usesis not None(treats0as a valid value). This is more correct, but it's a subtle behavior change that could surface as a regression if any client sendsmax_completion_tokens: 0.2.
effective_max_tokenscan beNoneIf
max_tokensdefaults toNone(as suggested above) and the client doesn't sendmax_completion_tokenseither,effective_max_tokenswill beNone. In that case, we should omitmaxTokensfrominference_configentirely so Bedrock uses the model default, rather than sendingmaxTokens: Nonewhich would cause aParamValidationError: