Summary
find_possible_tp() in src/planner/capacity_planner.py:795 returns all integer divisors of num_attention_heads as valid tensor parallelism values. For example, Qwen2.5-14B has 40 attention heads, so the function returns {1, 2, 4, 5, 8, 10, 20, 40}.
However, vLLM only supports TP values that are powers of 2 (1, 2, 4, 8). This means TP=5, TP=10, etc. are returned as valid but would fail at deployment time.
Observed behavior
A recommendation was generated with "5x H200" (TP=5) for Qwen2.5-14B-Instruct, which is not a valid vLLM configuration.
Questions
- Should
find_possible_tp() filter to powers of 2 only?
- Are there other inference frameworks where non-power-of-2 TP is valid?
- If filtering to powers of 2, should this be done in
find_possible_tp() itself or at the call sites?
Summary
find_possible_tp()insrc/planner/capacity_planner.py:795returns all integer divisors ofnum_attention_headsas valid tensor parallelism values. For example, Qwen2.5-14B has 40 attention heads, so the function returns{1, 2, 4, 5, 8, 10, 20, 40}.However, vLLM only supports TP values that are powers of 2 (1, 2, 4, 8). This means TP=5, TP=10, etc. are returned as valid but would fail at deployment time.
Observed behavior
A recommendation was generated with "5x H200" (TP=5) for Qwen2.5-14B-Instruct, which is not a valid vLLM configuration.
Questions
find_possible_tp()filter to powers of 2 only?find_possible_tp()itself or at the call sites?