[APIServer] Dynamic default values for workers and max-concurrency based on platform#7497
[APIServer] Dynamic default values for workers and max-concurrency based on platform#7497
Conversation
… platform Change workers and max-concurrency defaults from hardcoded values (1 and 512) to None, then resolve them dynamically: - NVIDIA GPU (CUDA): workers = ceil(max_num_seqs / 64), max_concurrency = workers * 512 - Other platforms: workers = 1, max_concurrency = workers * 512 Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/b62082bc-b3bd-4495-ba1b-2d4cbfb8cf24
Address code review feedback: - Add safe fallback for max_num_seqs using getattr with default value 8 - Add log output for resolved max_concurrency value Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/b62082bc-b3bd-4495-ba1b-2d4cbfb8cf24
|
|
|
Thanks for your contribution! |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-20 11:18 CST
📋 Review 摘要
PR 概述:根据平台动态计算 --workers 和 --max-concurrency 的默认值,NVIDIA GPU 上按 max_num_seqs 自动伸缩 worker 数
变更范围:entrypoints/openai/ (api_server.py, utils.py)
影响面 Tag:APIServer
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | utils.py:398 |
魔法数字 64/512 缺乏解释,建议抽取为命名常量 |
| 🟡 建议 | utils.py:400 |
max_concurrency 计算逻辑在两个分支中重复,可简化 |
| ❓ 疑问 | utils.py:394 |
or 8 vs if is None 语义差异 |
总体评价
整体实现逻辑清晰,用户显式传值不会被覆盖的设计合理。主要建议是将魔法数字抽取为常量并简化重复的分支逻辑,提高可维护性。
|
|
||
| if current_platform.is_cuda(): | ||
| if args.workers is None: | ||
| args.workers = math.ceil(max_num_seqs / 64) |
There was a problem hiding this comment.
🟡 建议 魔法数字 64 和 512 缺乏解释
这里的 64 和 512 是关键的调优参数,但没有注释说明为什么选择这些值。建议抽取为命名常量并添加注释说明选择依据,方便后续维护和调优:
# 每个 worker 处理的最大序列数(经验值,平衡 worker 数与负载)
_SEQS_PER_WORKER = 64
# 每个 worker 的并发连接上限
_CONCURRENCY_PER_WORKER = 512| if args.workers is None: | ||
| args.workers = math.ceil(max_num_seqs / 64) | ||
| if args.max_concurrency is None: | ||
| args.max_concurrency = args.workers * 512 |
There was a problem hiding this comment.
🟡 建议 max_concurrency 计算逻辑在两个分支中完全相同,可以简化
CUDA 和非 CUDA 分支中 max_concurrency 的计算逻辑是相同的(workers * 512),可以提取到 if/else 之后,减少代码重复:
if current_platform.is_cuda():
if args.workers is None:
args.workers = math.ceil(max_num_seqs / 64)
else:
if args.workers is None:
args.workers = 1
if args.max_concurrency is None:
args.max_concurrency = args.workers * 512| """ | ||
| from fastdeploy.platforms import current_platform | ||
|
|
||
| max_num_seqs = getattr(args, "max_num_seqs", None) or 8 |
There was a problem hiding this comment.
❓ 疑问 使用 or 8 而非 if is None 的意图
getattr(args, "max_num_seqs", None) or 8 会在 max_num_seqs 为 0 或其他 falsy 值时也 fallback 到 8。虽然 max_num_seqs=0 在实践中不太可能出现,但如果只是想处理 None 的情况,使用显式的 if ... is None 更清晰:
max_num_seqs = getattr(args, "max_num_seqs", None)
if max_num_seqs is None:
max_num_seqs = 8
Motivation
--workersand--max-concurrencywere hardcoded to1and512respectively. On NVIDIA GPUs with largemax_num_seqs, a single worker becomes a bottleneck. Defaults should scale with the workload on capable hardware.Modifications
fastdeploy/entrypoints/openai/utils.py:--workersand--max-concurrencydefaults toNoneresolve_workers_and_concurrency(args)that resolves defaults based on platform:workers = ceil(max_num_seqs / 64),max_concurrency = workers * 512workers = 1,max_concurrency = workers * 512Noneis resolved)fastdeploy/entrypoints/openai/api_server.py:resolve_workers_and_concurrency(args)afterparse_args(), before any usagemax_concurrencyto startup loggingUsage or Command
Behavior is automatic. Explicitly passing
--workersor--max-concurrencystill overrides the computed defaults:Accuracy Tests
No model output changes — this only affects API server worker/concurrency configuration.
Checklist
pre-commitbefore commit.current_platform.is_cuda()which requires GPU hardware. The resolution function is straightforward and exercised on every server startup.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.