Skip to content

Commit 31bb049

Browse files
committed
[Model] Support Step-3.5-Flash
1 parent ba75777 commit 31bb049

5 files changed

Lines changed: 265 additions & 10 deletions

File tree

docs/source/tutorials/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,5 @@ multi_xpu_GLM-4.5
1111
multi_xpu_Qwen3-Coder-480B-A35B(W8A8)
1212
multi_xpu_DeepSeek-V3.2-Exp-w8a8
1313
multi_xpu_GLM-5-W8A8-INT8
14+
multi_xpu_Step-3.5-Flash
1415
:::
Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
# Multi XPU (Step-3.5-Flash)
2+
3+
## Run vllm-kunlun0.15.1-dev on Multi XPU
4+
5+
Setup environment using container:
6+
7+
```bash
8+
# !/bin/bash
9+
# rundocker.sh
10+
XPU_NUM=8
11+
DOCKER_DEVICE_CONFIG=""
12+
if [ $XPU_NUM -gt 0 ]; then
13+
for idx in $(seq 0 $((XPU_NUM-1))); do
14+
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpu${idx}:/dev/xpu${idx}"
15+
done
16+
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
17+
fi
18+
19+
export build_image="xxxxxxxxxxxxxxxxx"
20+
21+
docker run -itd ${DOCKER_DEVICE_CONFIG} \
22+
--net=host \
23+
--cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
24+
--tmpfs /dev/shm:rw,nosuid,nodev,exec,size=32g \
25+
--cap-add=SYS_PTRACE \
26+
-v /home/users/vllm-kunlun:/home/vllm-kunlun \
27+
-v /usr/local/bin/xpu-smi:/usr/local/bin/xpu-smi \
28+
--name "$1" \
29+
-w /workspace \
30+
"$build_image" /bin/bash
31+
```
32+
33+
### Offline Inference on Multi XPU
34+
35+
Start the server in a container:
36+
37+
```bash
38+
# export system variable
39+
# unset XPU_DUMMY_EVENT
40+
# export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
41+
# export XFT_USE_FAST_SWIGLU=1 #使用快速swiglu实现
42+
# export XPU_USE_FAST_SWIGLU=1 #使用moe算子中快速swiglu实现
43+
# export XMLIR_CUDNN_ENABLED=1
44+
# export XPU_USE_DEFAULT_CTX=1
45+
# export XMLIR_FORCE_USE_XPU_GRAPH=1
46+
# export XPU_USE_MOE_SORTED_THRES=128
47+
# export VLLM_HOST_IP=127.0.0.1
48+
# export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false
49+
# export VLLM_USE_V1=1
50+
# export USE_ORI_ROPE=1
51+
# export KUNLUN_DISABLE_SMALL_MOE=1 #step-3.5-flash temporary fix
52+
53+
# python /workspace/offline.py
54+
55+
from vllm import LLM, SamplingParams
56+
57+
llm = LLM(
58+
model="/models/Step-3.5-Flash",
59+
tensor_parallel_size=8,
60+
dtype="bfloat16",
61+
max_model_len=32768,
62+
gpu_memory_utilization=0.9,
63+
trust_remote_code=True,
64+
distributed_executor_backend="mp",
65+
block_size=128,
66+
max_num_seqs=128,
67+
max_num_batched_tokens=32768,
68+
enable_prefix_caching=False,
69+
enable_chunked_prefill=False,
70+
)
71+
72+
sampling_params = SamplingParams(
73+
temperature=0.7,
74+
top_p=0.9,
75+
top_k=10,
76+
max_tokens=512,
77+
stop=["<|end|>", "</s>"]
78+
)
79+
80+
prompt = """
81+
<|user|>
82+
你好,请介绍一下你自己
83+
<|assistant|>
84+
"""
85+
86+
outputs = llm.generate([prompt], sampling_params)
87+
print(outputs[0].outputs[0].text)
88+
```
89+
90+
:::::
91+
If you run this script successfully, you can see the info shown below:
92+
93+
```bash
94+
==================================================
95+
你好!我是 **Step**,由 **阶跃星辰(StepFun)** 开发的多模态大语言模型。
96+
我具备自然语言理解与生成、图像分析、视觉推理、数理逻辑、知识问答等多种能力。不仅能理解和处理文字信息,还能结合图片进行多模态推理与分析。
97+
98+
我的核心原则是:诚实可靠、有用友善、尊重隐私、促进积极交流、保持客观中立、拒绝有害内容。
99+
简单来说,我的目标是为你提供准确、有帮助、温暖的智能支持。
100+
101+
如果你愿意,可以告诉我你的兴趣或需求,我会尽力帮你实现目标 😊
102+
你想先了解我在哪些方面能帮到你吗?
103+
==================================================
104+
```
105+
106+
### Online Serving on Multi XPU
107+
108+
Start the vLLM server on a multi XPU:
109+
110+
```bash
111+
unset XPU_DUMMY_EVENT
112+
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
113+
export XFT_USE_FAST_SWIGLU=1 #使用快速swiglu实现
114+
export XPU_USE_FAST_SWIGLU=1 #使用moe算子中快速swiglu实现
115+
export XMLIR_CUDNN_ENABLED=1
116+
export XPU_USE_DEFAULT_CTX=1
117+
export XMLIR_FORCE_USE_XPU_GRAPH=1
118+
export XPU_USE_MOE_SORTED_THRES=128
119+
export VLLM_HOST_IP=127.0.0.1
120+
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false
121+
export VLLM_USE_V1=1
122+
export USE_ORI_ROPE=1
123+
export KUNLUN_DISABLE_SMALL_MOE=1 #step-3.5-flash temporary fix
124+
125+
python -m vllm.entrypoints.openai.api_server \
126+
--host 0.0.0.0 \
127+
--port 8356 \
128+
--model /models/Step-3.5-Flash \
129+
--gpu-memory-utilization 0.9 \
130+
--trust-remote-code \
131+
--max-model-len 32768 \
132+
--tensor-parallel-size 8 \
133+
--dtype bfloat16 \
134+
--max_num_seqs 128 \
135+
--max_num_batched_tokens 32768 \
136+
--block-size 128 \
137+
--no-enable-prefix-caching \
138+
--no-enable-chunked-prefill \
139+
--distributed-executor-backend mp \
140+
--served-model-name Step-3.5-Flash \
141+
--reasoning-parser step3p5 \
142+
--enable-auto-tool-choice \
143+
--tool-call-parser step3p5 \
144+
```
145+
146+
If your service start successfully, you can see the info shown below:
147+
148+
```bash
149+
(APIServer pid=133800) INFO: Started server process [133800]
150+
(APIServer pid=133800) INFO: Waiting for application startup.
151+
(APIServer pid=133800) INFO: Application startup complete.
152+
```
153+
154+
Once your server is started, you can query the model with input prompts:
155+
156+
```bash
157+
curl http://127.0.0.1:8356/v1/chat/completions
158+
-H "Content-Type: application/json"
159+
-d '{
160+
"model": "Step-3.5-Flash",
161+
"messages": [
162+
{"role": "user", "content": "你好,简单介绍一下你自己"}
163+
],
164+
"max_tokens":200,
165+
"temperature": 0.7
166+
}'
167+
```
168+
169+
Or use a Python script
170+
171+
```python
172+
import requests
173+
import json
174+
import re
175+
176+
URL = "http://127.0.0.1:8356/v1/chat/completions"
177+
178+
payload = {
179+
"model": "Step-3.5-Flash",
180+
"messages": [
181+
{"role": "user", "content": "你好,请介绍一下你自己"}
182+
],
183+
"max_tokens": 500,
184+
"top_p": 0.8,
185+
"top_k": 10,
186+
"temperature": 0.7,
187+
# "presence_penalty": 0.3,
188+
# "repetition_penalty": 1.05,
189+
# At present, the model’s responses occasionally suffer from accuracy issues; you may wish to try adjusting the sampling parameters.
190+
191+
}
192+
193+
headers = {
194+
"Content-Type": "application/json",
195+
"Authorization": "Bearer EMPTY"
196+
}
197+
198+
resp = requests.post(URL, headers=headers, json=payload)
199+
data = resp.json()
200+
201+
choice = data["choices"][0]
202+
content = choice["message"]["content"]
203+
204+
answer = content
205+
206+
print("\n===== ANSWER =====\n")
207+
print(answer)
208+
```
209+
210+
If you query the server successfully, you can see the info shown below (client):
211+
212+
```bash
213+
{"id":"chatcmpl-93112d4d8e047a9c","object":"chat.completion","created":1776166074,"model":"Step-3.5-Flash","choices":[{"index":0,"message":{"role":"assistant","content":"你好!我是 **Step**,由 **阶跃星辰(StepFun)** 开发的多\n","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":"好的,用户让我简单介绍一下自己。首先我得明确身份,我是Step,由阶跃星辰(StepFun)开发。用户可能刚接触我,需要基础信息,比如功能、特点以及使用原则。\n\n然后考虑用户的需求场景,可能是第一次使用AI助手,或者想比较不同的AI。需要突出我的多模态能力,比如处理文字和图片,还有逻辑推理、知识问答这些核心功能。同时要强调中文","reasoning_content":"好的,用户让我简单介绍一下自己。首先我得明确身份,我是Step,由阶跃星辰(StepFun)开发。用户可能刚接触我,需要基础信息,比如功能、特点以及使用原则。\n\n然后考虑用户的需求场景,可能是第一次使用AI助手,或者想比较不同的AI。需要突出我的多模态能力,比如处理文字和图片,还有逻辑推理、知识问答这些核心功能。同时要强调中文"},"logprobs":null,"finish_reason":"stop","stop_reason":1,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":17,"total_tokens":131,"completion_tokens":114,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
214+
215+
# python script
216+
===== ANSWER =====
217+
218+
你好!我是 **Step**,由 **阶跃星辰(StepFun)** 开发的大语言模型。
219+
220+
我具备以下主要能力和特点:
221+
- 🧠 **自然语言理解与生成**:能够流畅地进行多轮对话、写作、总结、翻译等;
222+
- 👁️ **多模态推理**:不仅能处理文字,还能理解和分析图片内容,进行视觉推理;
223+
- 📚 **知识问答与逻辑推理**:擅长基于事实回答问题,并解决数学、逻辑类任务;
224+
- 💡 **创意表达**:可辅助创作故事、诗歌、策划方案等富有创意的内容;
225+
- 🌍 **多语言支持**:能用多种语言与用户交流;
226+
- 🤝 **安全可靠**:遵循诚实、友善、尊重隐私的原则,保持客观中立。
227+
228+
我目前是 **完全免费使用** 的,不收集或存储你的个人隐私信息。你可以随时向我提问、讨论、创作或探索各种主题~
229+
230+
你想先了解我在哪方面最擅长吗?
231+
```
232+
233+
Logs of the vllm server:
234+
235+
```bash
236+
(APIServer pid=182858) INFO 04-14 19:45:26 [loggers.py:257] Engine 000: Avg prompt throughput: 1.7 tokens/s, Avg generation throughput: 19.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
237+
(APIServer pid=182858) INFO: 127.0.0.1:12670 - "POST /v1/chat/completions HTTP/1.1" 200 OK
238+
(APIServer pid=182858) INFO 04-14 19:45:36 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 24.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
239+
(APIServer pid=182858) INFO 04-14 19:45:46 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
240+
```

vllm_kunlun/ops/_kunlun_ops.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,15 @@
1818
"""kunlun custom op entry"""
1919

2020
from typing import Optional
21-
21+
import os
2222
import cocopod # noqa
2323
import torch
2424
import xspeedgate_ops # noqa
2525
from vllm.logger import init_logger
2626
from vllm.v1.worker.workspace import current_workspace_manager
2727

28+
DISABLE_SMALL_MOE = os.environ.get("KUNLUN_DISABLE_SMALL_MOE", "0") == "1"
29+
2830
logger = init_logger(__name__)
2931

3032
try:
@@ -475,7 +477,7 @@ def fused_moe(
475477
# attn_metadata = attn_metadata[prefix]
476478

477479
# if attn_metadata is None or attn_metadata.num_prefills > 0 or :
478-
if M * moe_top_k < 400:
480+
if M * moe_top_k < 400 and not DISABLE_SMALL_MOE:
479481
sorted_tokens_idx, sorted_tokens_num_lod, moe_expand = (
480482
torch.ops.xspeedgate_ops.moe_pre_small(
481483
topk_ids, global_num_experts, False, False, hidden_states

vllm_kunlun/ops/fused_moe/layer.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,15 +38,24 @@ def apply_monolithic(
3838
layer,
3939
x: torch.Tensor,
4040
router_logits: torch.Tensor,
41+
routed_scaling_factor: float = 1.0,
4142
) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]:
4243
"""
4344
Monolithic mode entry point.
4445
When is_monolithic=True, FusedMoE.forward_impl calls this method
4546
directly with (layer, hidden_states, router_logits), bypassing
4647
the default routing logic.
48+
49+
Note: upstream forward_impl does not pass routed_scaling_factor,
50+
so we read it from the layer attribute (consistent with the CPU
51+
monolithic path in upstream vLLM).
4752
"""
4853
from vllm_kunlun.ops._kunlun_ops import KunlunOps as ops
4954

55+
scaling_factor = getattr(
56+
layer, "routed_scaling_factor", routed_scaling_factor
57+
)
58+
5059
if self.moe.use_ep:
5160
return ops.fused_moe_ep(
5261
x,
@@ -78,4 +87,5 @@ def apply_monolithic(
7887
e_score_correction_bias=layer.e_score_correction_bias,
7988
w1_bias=getattr(layer, "w13_bias", None),
8089
w2_bias=getattr(layer, "w2_bias", None),
90+
router_scaling_factor=scaling_factor,
8191
)

vllm_kunlun/v1/attention/backends/kunlun_attn.py

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@
1414
# limitations under the License.
1515
# This file is a part of the vllm-kunlun project.
1616
#
17+
import copy
18+
import inspect
1719
from dataclasses import dataclass
1820
from itertools import accumulate
1921
from typing import (
@@ -31,7 +33,9 @@
3133
import kunlun_ops
3234
import numpy as np
3335
import torch
36+
3437
from vllm.config import VllmConfig
38+
from vllm.utils.math_utils import cdiv
3539
from vllm.v1.attention.backend import (
3640
AttentionBackend,
3741
AttentionCGSupport,
@@ -41,21 +45,19 @@
4145
AttentionType,
4246
CommonAttentionMetadata,
4347
)
48+
from vllm.v1.attention.backends.fa_utils import get_flash_attn_version
4449
from vllm.v1.attention.backends.utils import split_decodes_and_prefills
50+
from vllm.v1.kv_cache_interface import AttentionSpec
4551

46-
from vllm_kunlun.ops.paged_attn import PagedAttention, PagedAttentionMetadata
52+
from vllm_kunlun.ops.paged_attn import (
53+
PagedAttention,
54+
PagedAttentionMetadata,
55+
)
4756

4857
if TYPE_CHECKING:
4958
from vllm.v1.core.sched.output import SchedulerOutput
5059
from vllm.v1.worker.gpu_input_batch import InputBatch
5160

52-
import inspect
53-
54-
from vllm.utils.math_utils import cdiv
55-
from vllm.v1.attention.backends.fa_utils import get_flash_attn_version
56-
from vllm.v1.kv_cache_interface import AttentionSpec
57-
58-
5961
class KunlunAttentionBackend(AttentionBackend):
6062
"""KunlunAttentionBackend"""
6163

0 commit comments

Comments
 (0)