Skip to content

[APIServer] Dynamic default values for workers and max-concurrency based on platform#7497

Draft
Copilot wants to merge 2 commits intodevelopfrom
copilot/update-default-values-workers-max-concurrency
Draft

[APIServer] Dynamic default values for workers and max-concurrency based on platform#7497
Copilot wants to merge 2 commits intodevelopfrom
copilot/update-default-values-workers-max-concurrency

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 20, 2026

Motivation

--workers and --max-concurrency were hardcoded to 1 and 512 respectively. On NVIDIA GPUs with large max_num_seqs, a single worker becomes a bottleneck. Defaults should scale with the workload on capable hardware.

Modifications

  • fastdeploy/entrypoints/openai/utils.py:
    • Changed --workers and --max-concurrency defaults to None
    • Added resolve_workers_and_concurrency(args) that resolves defaults based on platform:
      • NVIDIA GPU (CUDA): workers = ceil(max_num_seqs / 64), max_concurrency = workers * 512
      • Other platforms: workers = 1, max_concurrency = workers * 512
    • User-provided values are never overridden (only None is resolved)
  • fastdeploy/entrypoints/openai/api_server.py:
    • Calls resolve_workers_and_concurrency(args) after parse_args(), before any usage
    • Added max_concurrency to startup logging

Usage or Command

Behavior is automatic. Explicitly passing --workers or --max-concurrency still overrides the computed defaults:

# Auto-resolved based on platform and max_num_seqs
fastdeploy serve --model my_model

# Explicit override still works
fastdeploy serve --model my_model --workers 4 --max-concurrency 2048

Accuracy Tests

No model output changes — this only affects API server worker/concurrency configuration.

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. No unit tests added — logic depends on current_platform.is_cuda() which requires GPU hardware. The resolution function is straightforward and exercised on every server startup.
  • Provide accuracy results. N/A — no model output changes.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI added 2 commits April 20, 2026 02:47
… platform

Change workers and max-concurrency defaults from hardcoded values (1 and 512)
to None, then resolve them dynamically:
- NVIDIA GPU (CUDA): workers = ceil(max_num_seqs / 64), max_concurrency = workers * 512
- Other platforms: workers = 1, max_concurrency = workers * 512

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/b62082bc-b3bd-4495-ba1b-2d4cbfb8cf24
Address code review feedback:
- Add safe fallback for max_num_seqs using getattr with default value 8
- Add log output for resolved max_concurrency value

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/b62082bc-b3bd-4495-ba1b-2d4cbfb8cf24
Copilot AI review requested due to automatic review settings April 20, 2026 03:04
Copilot AI review requested due to automatic review settings April 20, 2026 03:04
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 20, 2026

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Apr 20, 2026
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-20 11:18 CST

📋 Review 摘要

PR 概述:根据平台动态计算 --workers--max-concurrency 的默认值,NVIDIA GPU 上按 max_num_seqs 自动伸缩 worker 数
变更范围entrypoints/openai/ (api_server.py, utils.py)
影响面 TagAPIServer

问题

级别 文件 概述
🟡 建议 utils.py:398 魔法数字 64/512 缺乏解释,建议抽取为命名常量
🟡 建议 utils.py:400 max_concurrency 计算逻辑在两个分支中重复,可简化
❓ 疑问 utils.py:394 or 8 vs if is None 语义差异

总体评价

整体实现逻辑清晰,用户显式传值不会被覆盖的设计合理。主要建议是将魔法数字抽取为常量并简化重复的分支逻辑,提高可维护性。


if current_platform.is_cuda():
if args.workers is None:
args.workers = math.ceil(max_num_seqs / 64)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 魔法数字 64512 缺乏解释

这里的 64512 是关键的调优参数,但没有注释说明为什么选择这些值。建议抽取为命名常量并添加注释说明选择依据,方便后续维护和调优:

# 每个 worker 处理的最大序列数(经验值,平衡 worker 数与负载)
_SEQS_PER_WORKER = 64
# 每个 worker 的并发连接上限
_CONCURRENCY_PER_WORKER = 512

if args.workers is None:
args.workers = math.ceil(max_num_seqs / 64)
if args.max_concurrency is None:
args.max_concurrency = args.workers * 512
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 max_concurrency 计算逻辑在两个分支中完全相同,可以简化

CUDA 和非 CUDA 分支中 max_concurrency 的计算逻辑是相同的(workers * 512),可以提取到 if/else 之后,减少代码重复:

if current_platform.is_cuda():
    if args.workers is None:
        args.workers = math.ceil(max_num_seqs / 64)
else:
    if args.workers is None:
        args.workers = 1

if args.max_concurrency is None:
    args.max_concurrency = args.workers * 512

"""
from fastdeploy.platforms import current_platform

max_num_seqs = getattr(args, "max_num_seqs", None) or 8
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 使用 or 8 而非 if is None 的意图

getattr(args, "max_num_seqs", None) or 8 会在 max_num_seqs0 或其他 falsy 值时也 fallback 到 8。虽然 max_num_seqs=0 在实践中不太可能出现,但如果只是想处理 None 的情况,使用显式的 if ... is None 更清晰:

max_num_seqs = getattr(args, "max_num_seqs", None)
if max_num_seqs is None:
    max_num_seqs = 8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants