GLM4.7-30B推理速度问题，并且倾向于输出很长的回答


感谢团队提供优秀的模型和资源！在尝试应用的过程中遇到点问题 ，麻烦帮忙看一下，感谢！

### System Info / 系統信息

H20-156G 4张卡
使用的vllm-0.14.0rc2.dev173+g13f6630a9-cp38-abi3-manylinux_2_31_x86_64+transformer5.00dev+flash-attention2.8.3


### Who can help? / 谁可以帮助到您？

_No response_

### Information / 问题信息

- [ ] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务

### Reproduction / 复现过程

使用4卡H20-156G*4 的显卡进行推理，一样的配置，输入输出，速度对比如下：
1. GLM4.7-30B-FLASH【最大1W5+token】：平均264233 tokens / 157.79s = 1674.58 tokens/s
2.  GLM4.7-30B-FLASH【最大2500token】：191839 tokens / 58.56s = 3275.70 tokens/s
但是这样在我的任务上很多输出都没有完毕就截断了
3 QWEN3-30B-A3B（都不用MTP）：[生成速度] 44581 tokens / 11.08s = 4024.44 tokens/s



GLM4.7似乎倾向于输出很长的回答？且速度确实慢一点， 这样使用体验感觉比稠密模型还要慢呀？请问是什么原因，是模型本身还是VLLM的实现没支持还是flash-attention的问题呢？


llm = LLM(
        model=model_path, 
        gpu_memory_utilization=0.75, 
        trust_remote_code=True,
        tensor_parallel_size=4,  # 按照官方推荐配置
        dtype="auto",  # 自动选择最佳精度
        max_model_len=16384,  # 按照官方推荐配置，支持max new tokens 16384
        # MTP推测解码配置（按照官方推荐：--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4）
        speculative_config={
            "method": "mtp",
            "num_speculative_tokens": 1,
            "num_speculative_steps": 3,
            "eagle_topk": 1,
            "num_draft_tokens": 4,
        },
    )

把max_tokens改到2500 还是差不多的速度

### Expected behavior / 期待表现

速度至少和QWEN3持平

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GLM4.7-30B推理速度问题，并且倾向于输出很长的回答 #131

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GLM4.7-30B推理速度问题，并且倾向于输出很长的回答 #131

Description

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions