新增功能
- 新增GLM 4.5文本类模型部署支持 #3928
- 新增GPT-OSS-BF16文本类模型部署支持 #4240
- 新增ERNIE-4.5-VL-28B-A3B-Thinking多模态思考模型部署支持,详见文档
- 新增PaddleOCR-VL多模态模型部署支持 #4936
- 多模态模型和思考模型增加受限解码StructredOutput支持 #2749
- 多模态模型增加Prefix Caching与Encoder Caching支持 #4134
- 新增Wfp8Afp8在线量化推理支持 #4051 #4238
- 新增静态Cfp8量化推理支持 #4568
- LogProb功能
- HuggingFace Safetensors模型升级为默认能力
- Nvidia GPU下CUDA Graphs功能的完善
- 新增终端命令行CLI工具集
- 新增
engine-worker-queue-port与cache-queue-port的匿名端口支持 #4597 - 新增```LogitsProcessors````后处理参数支持 #4515
- 新增ERNIE-45-VL-Thinking模型的ReasoningParser与ToolParser #4571
usage字段返回新增多模态输入与输出Token、思考Token的统计 #4648 #4520- 新增
n参数支持单请求返回多个生成结果 #4273 - 离线推理chat接口新增
tool参数支持工具调用 #4415 - 多模态数据预处理增加对url数据的下载增加重试 #3838
性能优化
- 优化per_token_quant_fp8算子性能,提升50% #4238
- MTP支持Chunked Prefill与V1 KVCache调度 #3659 #4366
- V1 KVCache调度增加对上下文缓存的支持,并作为默认配置 #3807 #3814
- 优化MLA kernel性能,支持auto chunk + graph下的高性能MLA kernel #3886
- 优化Qwen-VL中ViT模块的CPU同步耗时 #4442
- Machete GEMM支持WINT4/WINT8以及group scale,并作为默认dense GEMM后端,优化模型性能与精度 #4451 #4295 #4121 #3999 #3905
- 优化append attention前处理算子性能 #4443 #4369 #4367
- 思考长度裁剪功能自定义算子化,实现更鲁棒更规范 #4279 #4736
- INTEL HPU优化多卡场景下sampling #4445
- 新增MergedReplicatedLinear方法,支持DeepSeek,qkv_a_proj融合 #3673
- 优化DeepEP buffer显存;支持EP场景下DeepEP buffer的creat/delete功能 #4039
- 优化集中式EP场景下DeepEP clear buffer带来的降速 #4039
- spec decode适配qk norm #3637
- 优化MLA Kernel性能,支持auto chunk + CUDA Graphs #3886
- 解决KV Cache容量分配偏小问题 #4355
- Engine与Worker跨进程通信支持零拷贝方式传输多模态张量数据 #4531
- APIServer支持gunicore+uvicorn优化前处理耗时 #4496 #4364
多硬件
- 昆仑芯P800
- 沐曦C550
- 天数CoreX
文档
- 新增终端命令行工具CLI命令使用说明 #4569
- 新增优雅退出方案 #3785
- 更新模型支持文档 #4754
- 新增2Bit量化方式和最佳实践 #3819 #3968
- 新增DP并行部署文档 #3883
- 新增昆仑芯ERNIE-4.5-VL模型部署文档 #4586
- 新增XPU PaddleOCR-VL模型部署文档 #4792
- 更新模型最佳实践文档 #3969
- 新增ERNIE-4.5-21B-A3B-Thinking最佳实践文档 #3994
- 更新metrics指标说明文档 #4061
- 更新接口参数文档,增加
completion_tokens、rompt_tokens、tool_calls说明 #4421
Bug修复
- 修复DP并行场景下Prefix Caching无法部署问题 #4359 #4370
- 修复集中式EP并行部署下长输入KVCache调度Hang住问题 #4275
- 修复开启CUDA Graphs时noaux_tc算子报错CUDA 700问题 #4174
- 修复V1 Loader下TritonMoEBlockWiseFP8权重shape错误 #4384
- 修复EP场景下MoE前处理问题,增加num_experts_per_rank合法值 #4102
- 修复CustomAllReduce输出不稳定问题 #4437
- 修复昆仑芯下思考长度限制,只有思考无回复内容问题 #4539 #4760
- 修复推理异常退出场景下KVCache管理进程残留问题 #4410
- 修复部分场景默认开启ChunkedPrefill报错问题 #3759
- 修复调度方法导致DeepSeek模型CudaError问题 #4757
- 修复XPU多模下默认开启上下文缓存bug #4694
- 修复MTP与C8场景下模型加载问题 #4077
- 修复MLA默认开启TensorCore的bug #4354
- 修复APIServer连接重复初始化的问题 #3901
- 修复MultiAPIServer日志地址混乱问题 #3967
- 修复多机张量并行无法部署问题 #4377
- 修复Qwen-VL系列模型无法关闭思考问题 #3808 #4762
- 修复APIServer的对话接口非流式返回场景下
finish_reason不正确问题 #4582 - 修复ERNIE-4.5-VL模型ReasoningPaserser中思考结束符错误问题 #4686
- 修复离线接口
enable_thinking强制False的不符合预期错误 #4248 - 修复ERNIE-4.5-VL对PNG格式透明背景图像的处理问题 #4847
- 修复rope3d开启FA3下的报错问题 #3791
- 修复部分硬件平台上算子导入出错问题 #4559
- 修复PD分离EP并行场景下启动推理服务的多个问题 # 4311 #4420 #4542 #4693 #4781
- 修复Metrics中
num_requests_running,num_requests_waiting,available_gpu_block_num统计不准确的问题 #4404 - 修复Trace日志在流式输出场景下trace span过多问题 #4375
- 修复动态C8计算错误问题 #4119
- 修复AppendAttention作为自定义算子注册下的Bug导致动静不统一问题 #4340
- 修复Qwen-VL系列模型预处理中视频与图片数据的占位符处理错误 #4065
- 修复模型组网存在的无用显存浪费问题 #3854
- 修复思考长度限制在并发场景下的Bug #4296
- 修复PD分离下IPC信号读取错误问题 #4309
- 修复metrics指标的共享目录命名冲突问题 #4007
- 修复昆仑芯barrier随机精度问题 #4181
- 修复思考长度限制超过上限时的异常问题 #4086
其它
- 修复沐曦硬件上的单测报错问题 #4027
- 修复沐曦硬件上的单测报错问题
test_get_save_output_v1单测偶发挂的问题 #4732 - 昆仑芯增加W4A8单测用例 #4501
Config配置代码优化,去除冗余字段 #4147 #4362 #4400- 第三方库采用submodule管理 #4033
- 新增DeepSeek-V3-0324端到端监控 #4360
- ERNIE-4.5-VL模型续推字段
generated_token_ids改为completion_token_ids#4086 - 后面进程异常退出时,APIServer进程自动退出提在终端输出提示 #3271
- Metrics增加若干可观测性指标 #3868
- 新增Attention层的性能单测 #4494
- DP+EP并行场景下支持模型权重的热更新 #3765 #3803 #3898
- 支持在训练场景下强制停止推理请求 #3601 #4402
- 修复在训练场景下Qwen3模型命名映射异常问题 #4338 #4322
- 修复流式请求
max_streaming_response_token参数不起作用问题 #3789 - 增加基于ZMQ回传worker推理结果至Engine的通信方式 #3521
What's Changed
- Add more runtime information to resource manager by @ming1753 in #3706
- Add CI cases by @ZhangYulongg in #3714
- Add loader test for mtp by @YuanRisheng in #3724
- fix typos by @co63oc in #3684
- add ci images build job by @XieYunshen in #3749
- [DOC] fix Document by @lizexu123 in #3782
- Update test_ernie_21b_mtp.py by @ZhangYulongg in #3783
- fix test_load_mtp by @co63oc in #3780
- [BugFix] Fix chunked prefill by @kevincheng2 in #3759
- [BugFix] fix max streaming tokens invalid by @ltd0924 in #3789
- [Feature] Setting number of apiserver workers automatically by @Jiang-Jia-Jun in #3790
- [Feature] mm and thinking model support structred output by @kevincheng2 in #2749
- [Feature] support model weight update in ep by @ltd0924 in #3765
- [BugFix] fix error of import paddle.base.core.Config by @yuanlehome in #3761
- [Executor] Fix bug of import paddle with RLHF by @gongshaotian in #3781
- rename speculate_stop_generation_multi_stop_seqs by @co63oc in #3743
- Modify mask_offset‘s format by @carryyu in #3525
- rename speculate_token_penalty_multi_scores.cu by @co63oc in #3735
- fix ce compile job by @XieYunshen in #3768
- [v1loader]Reduce EB300B model loading time by @bukejiyu in #3700
- 【Fix bug] w4afp8 的nblock固定为256,并且fa3的append attn 增加mask参数 by @yangjianfengo1 in #3771
- 【Hackathon 9th No.64】add test_draft_model_set_value_by_flags by @Echo-Nie in #3741
- [Feat] Support streaming transfer data using ZMQ by @Wanglongzhi2001 in #3521
- [BugFix] fix scheduler invalid by @ltd0924 in #3803
- rename fused_get_rope.cu by @co63oc in #3752
- 【Hackathon 9th No.84】Supplementary Unit Test for fastdeploy/reasoning by @Echo-Nie in #3570
- fix w8a8.py by @co63oc in #3733
- fix dcu_worker.py by @co63oc in #3734
- 【Hackathon 9th No.73】add unit tests for graph_opt_backend by @ooooo-create in #3609
- [XPU] FIX XPU CI BUG by @plusNew001 in #3829
- [Doc] update wint2 doc by @chang-wenbin in #3819
- fix test_append_attention_with_output.py by @carryyu in #3831
- [XPU] Update XPU CI case by @plusNew001 in #3837
- qk norm for speculate decode C16 by @rsmallblue in #3637
- [V1 Loader]V1 loader support EP by @YuanRisheng in #3801
- [Code Simplification] delete cum_offsets_out by @lizexu123 in #3815
- [Feature]
ernie4_5_vl_moesupport huggingface safetensor loading by @aquagull in #3750 - add reasoning parser plugin by @luukunn in #3811
- reopen ut by @XieYunshen in #3795
- Automatically configure workers based on max-num-seqs by @yyssys in #3846
- 【Hackathon 9th No.43、45】add speculate_get_padding_offset by @co63oc in #3730
- 【Hackathon 9th No.42】add test_speculate_get_output_padding_offset by @co63oc in #3740
- [XPU] Update XPU stable xvllm and xtdk version for 2.2 by @plusNew001 in #3853
- 【BUG FIX】Fixed moba single test port conflict by @yangjianfengo1 in #3863
- fix typo EngineSevice EngineService by @co63oc in #3841
- 【Hackathon 9th No.27】add test_get_padding_offset by @co63oc in #3708
- 【Hackathon 9th No.54、57】 add unit tests for per_token_quant and per_token_quant_padding by @ooooo-create in #3746
- [BugFix]add rollout config dp by @gzy19990617 in #3822
- Support extend block tables by @RichardWooSJTU in #3824
- 【Hackathon 9th No.34】add test_get_position_ids_and_mask_encoder_batch by @Echo-Nie in #3739
- 【Hackathon 9th No.63】add test_draft_model_postprocess.py by @co63oc in #3757
- [Feature] Set v1 scheduler as default in develop by @rainyfly in #3807
- fix response processsors by @RichardWooSJTU in #3826
- support mtp rope_3d by @xiaoxiaohehe001 in #3791
- [Feature][MTP]support mtp in v1_scheduler mode by @freeliuzc in #3695
- Graceful shut down by @xiaolei373 in #3785
- Support for async processor added. by @sunlei1024 in #3869
- [CI] update paddleformers==0.2 in develop by @EmmonsCurse in #3878
- Update test_ernie_21b_mtp.py by @ZhangYulongg in #3885
- [BugFix] fix qwen vl processor by @ltd0924 in #3808
- [Docs] add data parallel by @ltd0924 in #3883
- 【Hackathon 9th No.35】add test_moe_redundant_topk_select by @Echo-Nie in #3867
- 【BugFix】fix gpu mem oom by @gzy19990617 in #3854
- 【Hackathon 9th No.32】add unit tests for group_swiglu_with_masked by @ooooo-create in #3748
- 【Inference Optimize】Update MergedReplicatedLinear for DSK qkv_a_proj_with_mqa. by @chang-wenbin in #3673
- [fix]load hadamard_block_size from config by @rsmallblue in #3797
- [Feature] support controller port in multi api server by @ltd0924 in #3898
- Compatible with EB 0.3B torch model arch by @ckl117 in #3913
- [Attention]clean_code by @zhoutianzi666 in #3917
- [Fix] mv connection_manager init by @ltd0924 in #3901
- add cache queue port by @ZhangYulongg in #3904
- rename eagle_get_base_model_hidden_states.cu by @co63oc in #3753
- [feature]Support model loading from cache by @bukejiyu in #3857
- ignore ci by @bukejiyu in #3950
- [Feature] add HTTP GET retry by @ApplEOFDiscord in #3838
- [XPU]Fixed the issue of performance degradation caused by enabling ENABLE_V1_KVCACHE_SCHEDULER by @iosmers in #3897
- [Bug fix] Fix prompt token ids dtype in v1 by @rainyfly in #3860
- supports dynamic Cfp8 by @carryyu in #3767
- Update sparse attn documentation by @yangjianfengo1 in #3954
- [Excutor] Experiment Feature-Support Prefill in cudagraph by @littledgg in #3459
- [metrics] Add serveral observability metrics by @qwes5s5 in #3868
- [Docs] Update env docs for Machete by @Sunny-bot1 in #3959
- rename ep_moe_prefill_func ep_moe_expert_dispatch by @co63oc in #3938
- fix typos by @co63oc in #3951
- [Optimize]Error messages about Model api. by @AuferGachet in #3839
- 【Doc】Update WINT2 Doc Pic by @chang-wenbin in #3968
- Modify markdown by @xiaolei373 in #3896
- [docs] update docs by @yangjianfengo1 in #3975
- 【Hackathon 9th No.22】add unit tests for share_external_data by @ooooo-create in #3744
- 【Hackathon 9th No.68】supplementary unit test for ngram_match by @Echo-Nie in #3732
- 【Hackathon 9th No.44】add test_speculate_get_token_penalty_multi_scores.py by @co63oc in #3742
- 【Hackathon 9th No.69】add test_draft_model_preprocess by @co63oc in #3832
- 【Hackathon 9th No.60、62】add eagle_get_hidden_states by @co63oc in #3876
- 【Hackathon 9th No.66】add test_speculate_set_stop_value_multi_seqs by @co63oc in #3941
- 【Hackathon 9th No.36】add test_extract_text_token_output by @Echo-Nie in #3862
- [docs] update best practice docs by @zoooo0820 in #3969
- [XPU]Release2.2 update release note by @iosmers in #3986
- 【Doc】update dsk doc by @chang-wenbin in #3989
- update doc by @bukejiyu in #3990
- del batch id per token by @carryyu in #3963
- [Docs] update VL best_practices for release/2.2 by @ming1753 in #3965
- [CI] update ci by @ZhangYulongg in #3962
- [docs] add a3b-thinking doc by @zoooo0820 in #3994
- 【docs】update index.html and dockfile by @yangjianfengo1 in #3998
- 【FIX】Change the name of sparse attn from moba to plas by @yangjianfengo1 in #3845
- 【Fix】Change the name of sparse attn from moba to plas by @yangjianfengo1 in #3993
- 【docs】 update readme by @yangjianfengo1 in #4000
- Revert "【Fix】Change the name of sparse attn from moba to plas" by @Jiang-Jia-Jun in #4002
- Revert "【FIX】Change the name of sparse attn from moba to plas" by @Jiang-Jia-Jun in #4001
- get org_vocab_size from args by @zeroRains in #3983
- [V1 Loader]Ernie kv cache quant support v1 loader by @YuanRisheng in #3899
- [V1 Loader] Support V1 Loader for Machete by @Sunny-bot1 in #3999
- metrics shared folder naming by @zhuangzhuang12 in #4007
- [MoE] clean code by @zhoutianzi666 in #4020
- [BugFix] Fix the abnormal memory usage caused by shape errors in the triton moe backend by @yuanlehome in #4026
- [xpu] add ep custom ops by @zhupengyang in #3911
- [Feat]
ernie4_5_vl_moesupport CudaGraph by @aquagull in #3226 - [Executor] Adjust signal sending order in RL training by @gongshaotian in #3773
- 【Hackathon 9th No.28】add test_cutlass_fp8_fp8_fp8_dual_gemm_fused by @WanRui37 in #3935
- [Fix] fix multi api server log dir by @ltd0924 in #3967
- [MTP]support rope_3d in spec mode by @freeliuzc in #4034
- [Feature] Support zai-org/GLM-4.5-Air BF16 model by @ckl117 in #3928
- 【Inference Optimize】DeepSeek-V3-model MLA Optimize by @chang-wenbin in #3886
- 【Hackathon 9th No.55】add test_update_inputs_v1.py by @co63oc in #3992
- [docs] Update environment variables documentation by @bukejiyu in #3957
- [BugFix] qwen2.5vl enable_thinking=true and image_patch_id bug fix by @CSWYF3634076 in #3921
- fix import tests.utils error in tests/model_loader/test_load_mtp.py by @handsomecoderyang in #4027
- [setup optimize]Support git submodule by @YuanRisheng in #4033
- [CI] skip test_structured_outputs* temporarily by @EmmonsCurse in #4055
- update ci by @ZhangYulongg in #4064
- [BugFix] mm_post_fix by @xiaoxiaohehe001 in #4005
- [Echo] Support more types of prompt echo by @AuferGachet in #4022
- [Feature] add cli command chat,complete by @memoryCoderC in #4037
- [bug fix] Fix the placeholder in qwen prompt and add some unittests by @lddfym in #4065
- [Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) by @ckl117 in #4051
- [Optimize] optimize prefix cache in develop by @rainyfly in #3890
- Add token processor plugin support by @RichardWooSJTU in #4059
- fix typos by @co63oc in #3840
- [metrics] update metrics markdown file by @qwes5s5 in #4061
- [CI] add multi api server test by @ltd0924 in #4049
- [Feature] refactor metax_gpu attention and moe and remove some useles… by @handsomecoderyang in #3688
- 【Hackathon 9th No.25】add test_fused_get_rotary_embedding by @Echo-Nie in #3892
- 【Hackathon 9th No.78】add test_chat.py by @co63oc in #3958
- [BugFix]Fix load kv cache quant scale by @YuanRisheng in #4077
- [format] Valid para format error info by @xiaolei373 in #4035
- [BugFix] Fix
image_feature0-Size causing insert failed by @aquagull in #4042 - fix(CE): update concurrency to stop CE tasks from canceling each other by @XieYunshen in #4083
- Support offline inference with streaming output by @xyxinyang in #4071
- 【FastDeploy CLI】collect-env subcommand by @qwes5s5 in #4044
- [Bug Fix]fix the bug for cache_messager signal loss by @zeroRains in #3879
- 【Hackathon 9th No.61、65、41】add test_draft_model_update by @co63oc in #3940
- 【Hackathon 9th No.49】add test_pre_cache_len_concat by @Echo-Nie in #3847
- [Optimize] Support WINT8 and group scale for Machete by @Sunny-bot1 in #3905
- [v1 loader]qwen Offline fp8 by @bukejiyu in #4036
- [xpu] support ep by @zhupengyang in #4067
- [CUDAGraph] Support multi output buffers and merge some fixes from feature/exp_0908 by @yuanlehome in #4062
- [MTP]update hybrid-mtp-with-ngram by @freeliuzc in #4047
- [MTP]Develop mtp reshard by @freeliuzc in #4099
- [BugFix]Fix Ernie bf16 model loading bug and add comments by @bukejiyu in #4106
- fix typos by @co63oc in #4093
- [BugFix]Fix key mismatch when load mtp by @YuanRisheng in #4105
- [submodule] add ignore=all for deepgemm by @yuanlehome in #4118
- [BugFix] Fix EP MoE expert dispatch function by @Sunny-bot1 in #4102
- 【Hackathon 9th No.37】add test_top_k_renorm_probs by @Echo-Nie in #3755
- 【Hackathon 9th No.52】add test_dynamic_per_token_scaled_fp8_quant by @co63oc in #4015
- [CE]add plas attention config by @tianlef in #4128
- Addcase by @DDDivano in #4112
- [benchmark]add lite-vl and x1 yaml by @xiegegege in #4130
- [Doc][CE]x1_a3b server config by @tianlef in #4135
- ci: Increase compilation task time limit by @XieYunshen in #4098
- fix dynamic Cfp8 computing error by @rsmallblue in #4119
- [Feature] Set prefix caching as default by @rainyfly in #3814
- Update test_w4a8_model.py by @ZhangYulongg in #4125
- mv test to tests by @XieYunshen in #4129
- [FDConfig]Remove max_num_batched_tokens/max_num_seqs in parallel config by @YuanRisheng in #4116
- [BugFix] Forbiden
FD_DISABLED_RECOVERwhileENABLE_V1_KVCACHE_SCHEDULERby @Jiang-Jia-Jun in #4142 - Reconstruct streaming data transfer with zmq by @RichardWooSJTU in #3836
- Print KV Cache available memory and block memory usage in GB format by @qw86972190 in #4148
- [BugFix]Fix test_prefix_cache by @YuanRisheng in #4155
- [NewFeture]add ep rollout model init and update/clear ep buffer by @gzy19990617 in #4039
- [CI] enhance clean port strategy by @EmmonsCurse in #4152
- [Feature] Support mixed deployment with yiyan adapter in develop by @rainyfly in #3976
- Add param valid log by @xiaolei373 in #4113
- [FastDeploy CLI]collect-env unitest bug fix by @qwes5s5 in #4159
- [Optimize] Machete uses group scale by default by @Sunny-bot1 in #4121
- Bugfix test exception by @xiaolei373 in #4171
- Each module should have its own plugins_loaded by @yuanlehome in #4164
- [Logprob] EP support logprob by @ckl117 in #4151
- [fix]Modify follow-up push parameters and Modify the verification method for thinking length by @luukunn in #4086
- [FDConfig]Remove splitwise_role and engine_worker_queue_port in FDConfig by @YuanRisheng in #4147
- 【Hackathon 9th No.46】add test_fused_rotary_position_encoding by @Echo-Nie in #3848
- [Bug fix] fix request assign by @RichardWooSJTU in #4163
- [TEST] init first commit by @cqulilujia in #4192
- fix nul by @co63oc in #4191
- [BugFix]fix glm all_reduce tp group by @ckl117 in #4187
- [Feature] support pool by @lizexu123 in #3827
- fix typos by @co63oc in #4176
- 【Hackathon 9th No.30】add test_tritonmoe_preprocess by @Echo-Nie in #3891
- [Feature] Support pd ep deployment with yiyan adapter by @rainyfly in #4029
- 【Hackathon 9th No.40】add test_top_p_candidates by @co63oc in #4046
- 【Hackathon 9th No.26】add test_set_value_by_flags_and_idx.py by @Echo-Nie in #4186
- [FD CLI] Add bench cli by @ZhangYulongg in #4160
- [Iluvatar GPU] Optimize attention performance and fix moe load ckpt e… by @wuyujiji in #3651
- Remove useless code by @Jiang-Jia-Jun in #4195
- [Feature] support clear data by @ltd0924 in #3601
- [XPU]change xpu ci model by @plusNew001 in #4117
- 【FIX】Change the name of sparse attn from moba to plas (#4006) by @yangjianfengo1 in #4076
- [XPU] update XPU CI by @plusNew001 in #4209
- [XPU] Update run_ci_xpu.sh to lock xvllm version by @plusNew001 in #4210
- [xpu] use cpu barrier by @zhupengyang in #4181
- [Feature] support qwen3-embedding model load by @lizexu123 in #4202
- Fix noaux_tc cuda Error 700 in CUDAGraph by @ckl117 in #4174
- [v1 loader]code style by @bukejiyu in #4204
- [Test]add glm45_air logprob test and rollout model by @ckl117 in #4175
- [XPU] Enable XPU V1 mode based on environment variable by @yyssys in #4213
- register_model_class compatible with plugins by @yuanlehome in #4236
- 【Hackathon 9th No.24】add rebuild_padding by @co63oc in #4107
- [Intel HPU] Support intel hpu platform by @fmiao2372 in #4161
- [CUDAGraph] [FIX] Fix CUDA error(700): 'cudaErrorIllegalAddress' in CascadeAppendW… by @YuhanXu in #4218
- [BugFix]fix v1 loader moe bf16, and supoort dynamic_load_weight create quant param by @ckl117 in #4229
- [BugFix] fix qwen3-embedding model tp>1 by @lizexu123 in #4223
- [Bug Fix] disable prefix caching in mm model by @ApplEOFDiscord in #4167
- [Feature] add cli command serve by @memoryCoderC in #4226
- [OPs] MoE support wfp8afp8(channelwise) and improve per_token_quant_fp8 by @ckl117 in #4238
- [fix]update apply_chat_template by @luukunn in #4137
- [Model] Qwen2.5VL support --use-cudagraph and unit testing by @CSWYF3634076 in #4087
- [CUDAGraph]CUDA Graph support unique memory pool by @gongshaotian in #4230
- [BugFix]fix the bug for prefilled_step_idx signal of cache_messager in cudagraph and PD by @zeroRains in #4235
- 【Hackathon 9th No.21、23】add unit tests for fused_hadamard_quant_fp8, moe_fused_hadamard_quant_fp8 by @ooooo-create in #4094
- [XPU] support XPU VL model inference by @cqulilujia in #4030
- delete moe_phase in parallel_config(Moved to model_config) by @yuanlehome in #4264
- Support limit thinking lengths. by @K11OntheBoat in #4069
- [Docs]Add ENABLE_V1_KVCACHE_SCHEDULER=0 to docs by @yyssys in #4268
- 【fix】Remove the logic that assigns the default value of 80% to reasoning_max_tokens in the offline component of FastDeploy. by @kxz2002 in #4248
- [Feature] add config api by @memoryCoderC in #4254
- [CI] fix base_test error temporarily by @EmmonsCurse in #4283
- [Supplements and upgrades]Improvement of X1 parsers by @AuferGachet in #4172
- fix ernie vl distributed attr. by @ZHUI in #4215
- [Doc]add glm benchmark yaml by @tianlef in #4289
- Add cli run batch by @xiaolei373 in #4237
- Add speculative decoding approval check by @Deleter-D in #4284
- Set approve checking for config.py, worker, model and cudagraph by @zeroRains in #4276
- [Docs]The XPU model loader uses the default version by @yyssys in #4292
- increase ccache size by @XieYunshen in #4255
- [Feature] deepgemm pre-compile tool support mixed parallel by @Deleter-D in #4282
- fix typos by @ccsuzzh in #4274
- [fix]remove reasoning_max_tokens=max_toksns*0.8 in sampling_params by @luukunn in #4277
- [Bug fix] Fix bug for running ep by @rainyfly in #4245
- Fix wrong batch size of thinking_mask by @K11OntheBoat in #4296
- [BugFix] Increase the conditions for the use of a Machete: not pre-quant by @Sunny-bot1 in #4295
- [XPU] fix VL thinking mode by @cqulilujia in #4266
- [feat] support prefix cache clearing when
/clear_load_weightis called by @liyonghua0910 in #4008 - add_cli_tokenizer by @xiaolei373 in #4278
- [fix] fix gpu_cache_kvs key by @liyonghua0910 in #4311
- 【Feature】ResourceManagerV1 support need block num notifying by @RichardWooSJTU in #4220
- Fix bugs of splitwise_complete_prefilled_step IPCsignal clear by @K11OntheBoat in #4309
- [Metax] support cutlass moe & optimize flash attention & fix triton moe by @xiaozude in #4208
- [NewFeature]custom_allreduce support cudagraph recapture by @ckl117 in #4305
- [FIx] CI Approve fix by @zeroRains in #4316
- [BugFix]remove redundant includes by @fangfangssj in #4312
- [Bug fix]revert worker process ipc signal suffix to fix ep by @RichardWooSJTU in #4323
- 【Inference Optimize】Support MLA_CACHE & Fix V1_Schedule Bug by @chang-wenbin in #4318
- 【Fix】updata docs by @yangjianfengo1 in #4339
- [Doc] Update xpu fastdeploy version to 2.2.1 by @yyssys in #4338
- 【Bug-Fix】schedule_bugfix by @chang-wenbin in #4336
- [Executor]CUDAGraph support Speculate Decode by @gongshaotian in #3769
- Remove redundant inplace outputs for
append_attentionby @SigureMo in #4340 - 【Inference Optimize】MLA Tensor-Core is enabled by default by @chang-wenbin in #4335
- 【Hackathon 9th No.86】autogen
MultiQueryAppendC8Attentiontemplate_instantiation -part by @ccsuzzh in #4330 - [XPU] Support W4A8C8-TP4-300B Model by @iosmers in #4068
- supports spec dynamic cfp8 by @carryyu in #4290
- [FastDeploy Cli] Bench Command eval and throughput by @qwes5s5 in #4239
- add release images build job by @XieYunshen in #4265
- 【Bug Fix】mla enables tensorcore by default by @chang-wenbin in #4354
- 【Inference Optimize】Calculate paddle_peak_increase using paddle_allocated_mem_after_run by @chang-wenbin in #4355
- 【Add CI】Add DeepSeek model end-to-end CI by @chang-wenbin in #4360
- [Feature] support prefix cache in DP by @ltd0924 in #4359
- 【BugFix】fix qwen3moe name_mapping config by @gzy19990617 in #4348
- [MTP]support more branchs in topp kernel by @freeliuzc in #4352
- [FDConfig]Remove max_model_len in FDConfig by @YuanRisheng in #4350
- [XPU] fix XPU CI bug by @plusNew001 in #4358
- [Doc] fix document navigation link paths by @yyssys in #4368
- 【Hackathon 9th No.20】add unit tests for masked_per_token_quant by @ooooo-create in #4111
- [Doc] fix the port conflict issue in the usage example by @EmmonsCurse in #4379
- [Optimization] Fuse get_max_len and get_kv_max_len by @Sunny-bot1 in #4369
- [CI] fix diff_error temporarily by @EmmonsCurse in #4390
- 【Hackathon 9th No.67】add speculate_verify by @co63oc in #4326
- [Doc]add x1 a3b quantization yaml by @tianlef in #4397
- [Doc] fix offline inference doc by @ApplEOFDiscord in #4412
- [Docx] add PaddlePaddle nightly build address for GPU by @yangjianfengo1 in #4414
- [CI] Fix partial instability issues by @EmmonsCurse in #4418
- [benchmark] Update benchmark tools by @ZhangYulongg in #4416
- [Optimization] Optimize split_q_block kernel by @Sunny-bot1 in #4367
- [XPU] fix ep by @zhupengyang in #4393
- [BugFix] fix multinode bugs by @ltd0924 in #4377
- [fix] Fixed the issue of excessive/redundant spans being returned for streaming requests. by @qwes5s5 in #4375
- [fix] fix requests & block metrics by @liyonghua0910 in #4404
- [MTP]support mtp chunk_prefill_v1 by @freeliuzc in #4366
- 【BugFix】fix block_wise_fp8_v1_loader_moe_shape by @ckl117 in #4384
- 【Fix CI Bug】Fix ci bug by @chang-wenbin in #4413
- Disable gcu ci by @tianshuo78520a in #4427
- V1 loader default by @bukejiyu in #4251
- [BugFix] fix workers=1 by @ltd0924 in #4364
- [XPU] fix VL multi-batch accuracy issue by @cqulilujia in #4394
- [BUGFIX] clear request #4286 by @ltd0924 in #4402
- fix param by @freeliuzc in #4419
- Feature:Add support for Pooling Model Embedding and provide an OpenAI-compatible API. by @sunlei1024 in #4344
- [benchmark] Update benchmark_serving.py by @ZhangYulongg in #4438
- [BugFix] fix config bugs by @ltd0924 in #4370
- [XPU] support prefix cache by @ddchenhao66 in #4423
- [Bug fix] Fix pd for x1 thinking by @rainyfly in #4433
- [Bugfix]fix ep clear buffer perf by @gzy19990617 in #4389
- [XPU] moe support VL 0-size input by @cqulilujia in #4408
- [benchmark] Fix benchmark duration calculation logic by @ZhangYulongg in #4446
- [benchmark] Add filtering for failed requests in benchmark outputs by @ZhangYulongg in #4448
- perf: optimize ZMQ communication with async queue and single-threaded… by @sunlei1024 in #4444
- [BUG] fix ep bug by @kevincheng2 in #4275
- 【Hackathon 9th No.86】autogen
MultiQueryDecoderAttentiontemplate_instantiation -part by @ccsuzzh in #4383 - [benchmark ] Update benchmark tools by @ZhangYulongg in #4454
- 【Fix】 remove text_after_process & raw_prediction by @LiqinruiG in #4421
- [Intel HPU] Enable dist sampler on intel hpu platform by @JianyuLi01 in #4445
- [xpu] refine fused_moe by @zhupengyang in #4219
- [SOT][CUDAGraph] Add support for custom all-reduce operators under SOT mode by @DrRyanHuang in #4386
- [FDConfig]Remove total_block_num/dtype/block_size/enc_dec_block_num in ParallelConfig by @YuanRisheng in #4400
- [bugfix] kill cache_transfer_manager process by @xiaolei373 in #4401
- [BugFix]Fix wfp8afp8 triton moe group_topk renormalized=True by @ckl117 in #4449
- [Comments] Modify comments to English by @xiaolei373 in #4460
- [FDConfig]Remove reasoning_parser/guided_decoding_backend/disable_any_whitespace/device_ids in FDConfig by @YuanRisheng in #4362
- [SOT][Cudagraph] Remove BreakGraph of #3302 && update CustomOp by @DrRyanHuang in #3694
- [Optimization] Put get_block_shape_and_split_kv_block in cuda graph for append attention backend by @Sunny-bot1 in #4443
- [Others] add PR Template by @zeroRains in #4452
- [Optimize] Set preempted schedule log as info level by @rainyfly in #4453
- [SOT] Change warnings to errors and remove fallback operations by @DrRyanHuang in #4378
- [BugFix]Dev fix custom ar unstable result by @ckl117 in #4437
- [CINN] Remove the restriction of automatically falling back to SOT after enabling CINN by @DrRyanHuang in #4411
- [Feature] support pooling model dummy_run by @lizexu123 in #4345
- [XPU] abstract a hardware-agnostic operator wrapper for prefix cache and specify xpu device id definition by @ddchenhao66 in #4455
- [CI] Fix partial instability issues by @EmmonsCurse in #4461
- [Docs]add environment variables to environment_variables.md by @xiaolei373 in #4466
- [Perf][Qwen25_VL] Avoid triggering CPU synchronization in ViT's attn forward by @aquagull in https://github.com/PaddlePaddle/FastDeploy/pull/4442
- [Docx] add language (en/cn) switch links by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4470
- [Iluvatar GPU] Adapt VL model by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/4313
- [Loader]check paddle version for v1 loader by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4473
- [XPU]Fix w4a8 precision bug && rollback moe algo by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4463
- [Docx] fix the broken link by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4479
- LLM.chat add "tools" param by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4415
- 【feature】support n parameter by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4273
- [ATTN]delete code and add ffn and moe layer level test by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4440
- [CI] Handle unit test issues by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4483
- [Graph Optimization][Speculative Decoding] Fix the bug of CUDAGraph + MTP + EP by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4456
- Optimization of ‘tools’ in request fields by @AuferGachet in https://github.com/PaddlePaddle/FastDeploy/pull/4380
- [Metax] adjust mctlass moe api by @handsomecoderyang in https://github.com/PaddlePaddle/FastDeploy/pull/4474
- Support GPT-OSS-BF16 by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4240
- [Feature] support mtp logprob by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4464
- [Loader]Qwen2.5-Math-PRM-7B and Ernie-VL-RM by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4319
- [fix] remove cache tensor creation for cache_transfer_manager by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4420
- [Benchmark] update benchmark scripts by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4497
- [XPU]Fix w4a8 garbled code issue by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4493
- [XPU] Fix vl multi-card allreduce bug by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4485
- [BugFix][CI] Clean up SOT code cache using
tearDownin CINN unitest by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4491 - Optimizing the performance of think length limit using custom operators by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4279
- [APIServer] support define gunicorn timeout by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4496
- [CI] update ernie-4_5-vl baseline by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4495
- 【BugFix】fix ep buffer clear by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/4450
- [XPU] bind block_attn kernel with pybind by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4499
- [Executor] Default use CUDAGraph by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/3594
- [Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4467
- 【CI】Add test cases for n parameter and streaming validation by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/4503
- [Fearture] Support mm model close prefix cache by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4459
- [Doc]add deepseek wint4 ce by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/4517
- [FDConfig]Turn on the CUDAGraph + Speculative Decoding switch by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4511
- Add comprehensive unit tests for limit_thinking_content_length operators by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/4510
- [XPU]add xpu ci ep case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4432
- feat: add post-processing step for pool_output by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4462
- [FDConfig]Turn on the CUDAGraph + MultiModel switch by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4512
- [Iluvatar GPU] fix ci error caused by rebuild_padding param and cuda graph by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/4504
- enhance set_stop_value_multi_ends and standardize the registration of some operators by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4525
- [CI] Remove redundant .coveragerc file and fix by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4521
- [XPU]Modify the xpu memory display unit of log by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4534
- [Feature] support fd return decode response by @zhuangzhuang12 in https://github.com/PaddlePaddle/FastDeploy/pull/4407
- [Feature] Support AsyncLLM by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4458
- [CI] stable test_rollout_model.py by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4536
- small change in test_fusedmoe.py by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4538
- c++ code format by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4527
- [XPU] Change XPU stable third-party version and add time-out by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4524
- [FDConfig]Turn on the CUDAGraph + RL switch by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4508
- [Others]Delete useless code by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/4544
- [BugFix]Fix finish reason by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4543
- [Doc]fix deepseek ce by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/4560
- [XPU] xpu support think length limit by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4539
- [CI] Optimize coverage upload reporting by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4547
- WINT4/WINT8 dense gemm default use Machete by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4451
- [XPU] merge apply_tp, ops support token_num = 0 by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4507
- [BugFix] Fix decode_type which has been deleted in req and optimize token client retry scheme by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4564
- [Graph Optimization] Support CUDAGraph Padding + MTP by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4545
- [FDConfig]Turn on the CUDAGraph + PD Disaggregation switch by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4530
- [KVCache] Support Static C8 by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4568
- [Metax] adapt DeepSeek by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/4498
- [BugFix] fix create_cache_tensor for ep by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4542
- [EP] fix adapter bugs by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4572
- [XPU]fix v1 hang bug by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4573
- [BugFix] fix import image_ops error on some platforms by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/4559
- [CLI]Update parameters in bench latecy cli tool and fix collect-env cli tool by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/4558
- [Graph Optimization] Add dy_runnable and introduce cudagraph_switch_threshold for cudagraph mode switching by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4578
- [XPU]Moe uses a new operator by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4585
- [Feature] Support Paddle-OCR by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4396
- [DataProcessor] add reasoning_tokens into usage info by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4520
- perf: Optimize task queue communication from engine to worker by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4531
- [CI] Clean up ports after processing results by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4587
- [CI] Add /re-run command in PR comments to restart failed CI workflows by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4593
- [Others] api server exits when worker process is dead by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/3271
- [XPU] bind some OPs for VL model with pybind by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4522
- [V1 loader] Qwen25 VL support v1 loader and torch style safetensors load by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/4388
- [Feature] Support logprobs_mode by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4567
- [CI] Fix path error of /re-run by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4606
- [Feature] mm support prefix cache by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4134
- [Graph Optimization]1.fix the bug of draft model with ep 2.fix sampler bug by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4589
- [XPU] update kunlun doc about supported models by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4586
- benchmark工具适配SGLang框架 by @ophilia-lee in https://github.com/PaddlePaddle/FastDeploy/pull/4607
- [Unitest]Add unitest of Attention Layer by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/4494
- remove dev sync in prefill by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4598
- [BugFix] fix offline stream output when set enable_thinking param by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4603
- [BugFix] PaddleOCR-VL fix FD_DEBUG type and support v1 loader by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4605
- [Feature] EngineWorkerQueue anonymous port by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/4597
- [Docs] Add cli usage to docs by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4569
- [CI] fix run-batch port from env in unittest by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4613
- [CI] Relocate server test cases from ci_use directory to e2e by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4608
- [Graph Optimization][Speculative Decoding] Update yaml and fix typo by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4612
- Extend sleep time to 10 seconds in switch_service by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4618
- [Speculative Decoding][MTP]Support mtp in epdptp mode by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4614
- [BugFix] fix TPDP mix parallel infer by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/4583
- [Graph Optimization] Fix IR graph dependency error exposed after enabling SOT by updating the return value of TextImageGatherScatter by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4610
- [XPU]add xpu ci w4a8 case by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4501
- [CI][BugFix] fix port conflicts in concurrent ci test and add more unit test on async_llm by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4616
- [CI] Revert directory change of test_rollout_model due to intermittent failures by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4626
- feat: add support for API usage with multimodal models by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4548
- [XPU] Support PaddleOCR-VL model for XPU by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4529
- [Graph Optimization] Refactor default capture list by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4617
- [BugFix] fix paddleocr prefix cache bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4625
- [BugFix] fix import jit.marker.unified by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4622
- add einops dependency by @zhang-prog in https://github.com/PaddlePaddle/FastDeploy/pull/4633
- [Feature] support logits processors by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4515
- [Feature] support reward api by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4518
- [BugFix] fix total_block_num init error in worker_process by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4553
- [BugFix] Fix graph opt test case by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4634
- [Feature] add mm token usage by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4570
- [XPU] Update the return value of TextImageGatherScatter by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4636
- [Docs] Add PaddleOCR-VL-0.9B best practices by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4658
- [XPU] fix pos_emb_type bug by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4638
- [Docs] add Qwen25vl yaml by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/4662
- [Feature] add a new reasoning parser by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4571
- [XPU] [CI] Increase pytest timeout for XPU ep test by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4665
- add noaux_tc to unitest fused_moe by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4656
- [EP] fix several bugs in data parallel by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4657
- [OP] Add InferShape&InferDtype for
per_token_quant_paddingby @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4667 - 【Hackathon 9th No.86】autogen
MoeFastHardamardImplWrappertemplate_instantiation by @ccsuzzh in https://github.com/PaddlePaddle/FastDeploy/pull/4592 - [UT] Add ut for speculative sampler by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4650
- [Doc] update docs by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4675
- [Graph Optimization] Add the CUDAGraph usage switch for Draft Model by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4601
- [CI] Add test for paddleocr_vl by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4627
- [unitest]add real gate_correction_bias weight to mock real data dispatch by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4676
- [noauxtc_kernel] remove useless code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4643
- [BugFix] fix offline llm chat "enable_thinking" is always "False" by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4686
- [BugFix] fix total_block_num init error in worker_process and test_async_llm not throw error by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4687
- [BugFix] fix --logprobs-mode raw_logits by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4681
- [XPU] xpu currently disable prefix cache for VL model by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4695
- [XPU] [CI] Add Vl case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4649
- [BugFix] Fix finish reason in _create_chat_completion_choice by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4582
- [Feature] Unify the registration name recognition for tool_parser and reasoning_parser to “-” by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4668
- [BugFix] fix unittest of get_save_output_v1 by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/4701
- [XPU] [CI] Lock xvllm version by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4715
- [Graph Optimization] SOT+CUDAGraph support ERNIE4.5T VL 28B / 424B by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4645
- [Feature] support mtp distribution equivalence verification by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4699
- [KVCache] Support kv cache scale load by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4624
- add flops and bandwidth to test_ffn.py by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4704
- benchmark工具支持受限解码场景指定response_format by @ophilia-lee in https://github.com/PaddlePaddle/FastDeploy/pull/4718
- [CI] add missing unit tests for tokenizer_cli by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4620
- [Scheduler] update v1 prefill batch by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4611
- [BugFix] Fix profile run in pd-disaggregated deployment by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4584
- [BugFix] fix mm prefix_cache cuda error bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4679
- [Feature] Check bos url by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4711
- [BugFix] fix wint2 config by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/4721
- [FDConfig] [PD Disaggregation] [Graph Optimization] Close Cudagraph for P node when PD Disaggregation by @littledgg in https://github.com/PaddlePaddle/FastDeploy/pull/4632
- [XPU] xpu support neox style ROPE by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4719
- [BugFix] Skip building native architecture when specifying arch list by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4727
- fix noaux by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4731
- [BugFix] fix thinking bug by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4710
- [CI] Fix rollout_model test logic by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4730
- [Feature] support pooling model runner by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/4590
- format code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4720
- [CI] fix some ci yaml by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4747
- [Docs]Update XPU document version to 2.3.0 by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4741
- [Speculative Decoding][MTP]Support mtp in splitewise and scheduler_v1 mode by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4743
- [Speculative Decoding][MTP]Support attn mask offset by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4641
- [Docs]Add parameter to the start service command by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4753
- [Docs]Add parameter by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4755
- [Docs] fix PaddleOCR-VL docs bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4702
- [Feature] Support eplb for fd by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4599
- [XPU] add v1 support for bf16 by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4744
- 【DataProcessor】add options thinking_mode by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4735
- [Optimize] Support and robust for tpN for PD by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4595
- [Docs] fix error by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4768
- [CI]test common model by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4697
- [Metax] adapt cutlass moe for ernie-vl by @neilzhuu in https://github.com/PaddlePaddle/FastDeploy/pull/4685
- fix dynamic Cfp8 for RL load by @rsmallblue in https://github.com/PaddlePaddle/FastDeploy/pull/4144
- [Docs] PaddleOCR-VL add RTX3060 server param by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4765
- [BugFix] fix deepseek cuda error by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4739
- [XPU][CI] fix ci base value bug by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4783
- [OP]Fix attn_params by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4787
- [CI]delete test_common_model by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4794
- [XPU] fix thinking bug where output only contains reasoning_content by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4761
- [XPU] add deployment doc for PaddleOCR-VL in XPU by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4784
- [BugFix] Fix ernie4_5_vl_processor.py and qwen_vl_processor.py can not disable thinking by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4762
- supports internode_ll_two_stage by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/4162
- supports pd partn by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/4615
- [Docs] Add new support models by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4801
- [CI] Refactor CE wheel upload for multiple target paths by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4790
- [Docx] update mkdocs.yml by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4804
- [BugFix] Fix step_shm_value in PD disaggregated deployment by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4780
- Update Unit Test for PaddleOCR-VL by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4802
- [Metax] adapt cutlass moe and fix mla attention for DeepSeek by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/4602
- [Feature][Executor] GPU Model Runner Supports prompt_logprobs and max_logprobs by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4769
- [get_padding_offset.] clean get_padding_offset.cu by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4777
- support ep+tp at op layer by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4688
- [BugFix] fix reasoning parser register name by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4795
- remove input_ids from ForwardMeta by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4793
- [Feature] Add timestamp for profiler by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4726
- [XPU]Support V1 loader in weight_only Model by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4808
- [Bug Fix] process transparent image by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4807
- add paddleocr_vl benchmark by @zhang-prog in https://github.com/PaddlePaddle/FastDeploy/pull/4833
- [Doc] Update docs for v2.3.0rc0 by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/4828
- [BugFix] fix messages being inplace modified in offline chat api by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4831
- 【New Feature】W4afp8 supports per group quantization by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4272
- [CI] fix docker_build error and add tag-base by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4810
- [PD Disaggregation] Support Qwen3-MoE use PD + EP inference. by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/4691
- remove seq_lens_this_time by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4821
- [BugFix] Fix ernie_vl_reasoning_parsers.py 'end_token' to 'think_end_token' by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4805
- Fix: ci port conflict by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4840
- [CI] Add unittest for activation, native_paddle_backend, w4a8, w4afp8, platforms/utils by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4812
- [XPU][CI]Change ci vl model to 28 b by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4764
- [Fix] fix
ernie4_5_vlmodel torch format loadding by @aquagull in https://github.com/PaddlePaddle/FastDeploy/pull/4447 - [Feature] [PD] add simple router and refine splitwise deployment by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/4709
- [Docs] fix: correct typo in nvidia_gpu.md by @playaswd in https://github.com/PaddlePaddle/FastDeploy/pull/4848
- [BugFix] Fix list to List by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4818
- [BugFix] Del get_act_fn, _load_st_projector by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4824
- [Benchmark] Enhance benchmark output logging by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4682
- [XPU] ep+tp all2all by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4836
- [CI] Add Check PR Template by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4481
- Revert "【New Feature】W4afp8 supports per group quantization" by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4854
- [CI] Update deploy.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4850
- [CI] Optimize port cleanup logic by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4860
- [Bug Fix] fix ernie4_5_vl_moe by @LokeZhou in https://github.com/PaddlePaddle/FastDeploy/pull/4843
- Revert "[Bug Fix] fix ernie4_5_vl_moe" by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/4863
- [Feature] support mm disable_chunked by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4803
- [CI] Update ERNIE-4.5-VL baseline to adapt to MoE changes by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4867
- [CI] Refactor check-bypass logic in run_tests_with_coverage by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4655
- [Others] Delete PaddleOCR Useless Function by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4815
- [Feature] Optim PaddleOCR-VL by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4873
- [XPU] fix ep_tp all2all ci by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4876
- [XPU] modify 424B model deployment parameter by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4888
- [XPU][CI] Ci bug fix by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4889
- [BugFix] fix token_processor zmq by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4827
- [CI] fix docker_build error of ciuse by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4886
- [Metax] support ERNIE-4.5-VL-28B by @neilzhuu in https://github.com/PaddlePaddle/FastDeploy/pull/4820
- [BugFix] max_lgprobes=-1 maps to ori_vocab_size by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4884
- [Feature] Enable FastDeploy to support adding the “--api-key” authentication parameter. by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4806
- [Docs]Supplement the English and Chinese user documentation for Tool calling by @AuferGachet in https://github.com/PaddlePaddle/FastDeploy/pull/4895
- [XPU][CI]Update test assertion and base response value by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4907
- [BugFix] When the value of "temperature" is 0, adjust it to 1e-06 by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4900
- [Docs] add api-key usage instructions by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/4902
- [CI] Add four unittest by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4906
- [Bug Fix] fix bug for PD EP by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4823
- [DeepEP] support async prefill by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4899
- [XPU]Update documentation by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/4917
- [Docs] Improve reasoning_out docs by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/4901
- [BugFix] Fix inference_start_time by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4922
New Contributors
- @ooooo-create made their first contribution in #3609
- @zhupengyang made their first contribution in #3911
- @WanRui37 made their first contribution in #3935
- @handsomecoderyang made their first contribution in #4027
- @fmiao2372 made their first contribution in #4161
- @YuhanXu made their first contribution in #4218
- @ccsuzzh made their first contribution in #4274
- @xiaozude made their first contribution in #4208
- @fangfangssj made their first contribution in #4312
- @tianshuo78520a made their first contribution in #4427
- @JianyuLi01 made their first contribution in #4445
- @Limerances made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4240
- @ST-XX made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4597
- @zhang-prog made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4633
- @neilzhuu made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4685
- @juncaipeng made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4709
- @playaswd made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4848
Full Changelog: v2.2.1...v2.3.0