Release v2.3.0 · PaddlePaddle/FastDeploy

新增功能

新增GLM 4.5文本类模型部署支持 #3928
新增GPT-OSS-BF16文本类模型部署支持 #4240
新增ERNIE-4.5-VL-28B-A3B-Thinking多模态思考模型部署支持，详见文档
新增PaddleOCR-VL多模态模型部署支持 #4936
多模态模型和思考模型增加受限解码StructredOutput支持 #2749
多模态模型增加Prefix Caching与Encoder Caching支持 #4134
新增Wfp8Afp8在线量化推理支持 #4051 #4238
新增静态Cfp8量化推理支持 #4568
LogProb功能
- 支持EP并行下开启logprob #4151
- 支持MTP场景下开启logprob #4464 #4467
- 新增logprobs_mode参数指定返回结果的类型 #4567
HuggingFace Safetensors模型升级为默认能力
- Qwen2.5-VL系列支持 #3921
- ERNIE-4.5-VL系列模型支持 #4042
- 新增EP并行与Cache量化场景下支持 #3801
- 新增动态量化缓存机制，二次加载可使用缓存进行加载 #3857
Nvidia GPU下CUDA Graphs功能的完善
- CUDA Graphs默认在Decode阶段开启 #3594
- 使用统一内存池，降低显存开销 #4230
- 支持投机解码 #3769 #4545 #4617 #4669
- 支持TP、DP、EP混合并行 #4456 #4589
- 支持 PD 分离式部署 #4530
- 支持权重清理与动态加载下的重捕获 #3781 #3594
- 支持CustomAllReduce下开启CUDA Graphs重捕获 #4305
- 增加ERNIE-4.5-VL-MOE模型的支持 #3226
新增终端命令行CLI工具集
- chat：执行对话生成任务 #4037
- complete：执行文本补全任务 #4037
- serve：启动与OpenAI协议兼容的推理服务 #4226
- bench：对推理服务进行性能（延迟、吞吐）或精度评测
  - bench serve \ bench latency 精度评测工具 #4160 #4239
  - bench throughtput \ bench eval 性能评测工具 #4239
- collect-env：收集并打印系统、GPU、依赖等运行环境信息 #4044 #4558 #4159
- run-batch：批量执行推理任务，支持文件/URL输入输出 # 4237
- tokenizer：执行文本与 token 的编码、解码及词表导出 #4278
新增engine-worker-queue-port与cache-queue-port的匿名端口支持 #4597
新增```LogitsProcessors````后处理参数支持 #4515
新增ERNIE-45-VL-Thinking模型的ReasoningParser与ToolParser #4571
usage字段返回新增多模态输入与输出Token、思考Token的统计 #4648 #4520
新增n参数支持单请求返回多个生成结果 #4273
离线推理chat接口新增tool参数支持工具调用 #4415
多模态数据预处理增加对url数据的下载增加重试 #3838

性能优化

优化per_token_quant_fp8算子性能，提升50% #4238
MTP支持Chunked Prefill与V1 KVCache调度 #3659 #4366
V1 KVCache调度增加对上下文缓存的支持，并作为默认配置 #3807 #3814
优化MLA kernel性能，支持auto chunk + graph下的高性能MLA kernel #3886
优化Qwen-VL中ViT模块的CPU同步耗时 #4442
Machete GEMM支持WINT4/WINT8以及group scale，并作为默认dense GEMM后端，优化模型性能与精度 #4451 #4295 #4121 #3999 #3905
优化append attention前处理算子性能 #4443 #4369 #4367
思考长度裁剪功能自定义算子化，实现更鲁棒更规范 #4279 #4736
INTEL HPU优化多卡场景下sampling #4445
新增MergedReplicatedLinear方法，支持DeepSeek，qkv_a_proj融合 #3673
优化DeepEP buffer显存；支持EP场景下DeepEP buffer的creat/delete功能 #4039
优化集中式EP场景下DeepEP clear buffer带来的降速 #4039
spec decode适配qk norm #3637
优化MLA Kernel性能，支持auto chunk + CUDA Graphs #3886
解决KV Cache容量分配偏小问题 #4355
Engine与Worker跨进程通信支持零拷贝方式传输多模态张量数据 #4531
APIServer支持gunicore+uvicorn优化前处理耗时 #4496 #4364

多硬件

昆仑芯P800
- 新增ERNIE-4.5-VL系列模型的支持 #4030
- 新增PaddleOCR-VL 0.9B模型的支持 #4529
- BlockAttention算子支持neos版本rope #4723
- 新增W4A8精度支持 #4068
- 适配V1 KVCache调度 #4573
沐曦C550
- 优化Attention、MoE、RotaryEmbedding算子实现 #3688
- 新增DeepSeek-R1、DeepSeek-V3.1-BF16部署支持 #4498
天数CoreX
- 新增ERNIE-4.5-VL-28B-A3B部署支持 #4313
- ERNIE-4.5-300B-A47B推理性能优化 #3651
- 修复rebuild_padding错误问题 #4504

文档

新增终端命令行工具CLI命令使用说明 #4569
新增优雅退出方案 #3785
更新模型支持文档 #4754
新增2Bit量化方式和最佳实践 #3819 #3968
新增DP并行部署文档 #3883
新增昆仑芯ERNIE-4.5-VL模型部署文档 #4586
新增XPU PaddleOCR-VL模型部署文档 #4792
更新模型最佳实践文档 #3969
新增ERNIE-4.5-21B-A3B-Thinking最佳实践文档 #3994
更新metrics指标说明文档 #4061
更新接口参数文档，增加completion_tokens、rompt_tokens、tool_calls说明 #4421

Bug修复

修复DP并行场景下Prefix Caching无法部署问题 #4359 #4370
修复集中式EP并行部署下长输入KVCache调度Hang住问题 #4275
修复开启CUDA Graphs时noaux_tc算子报错CUDA 700问题 #4174
修复V1 Loader下TritonMoEBlockWiseFP8权重shape错误 #4384
修复EP场景下MoE前处理问题，增加num_experts_per_rank合法值 #4102
修复CustomAllReduce输出不稳定问题 #4437
修复昆仑芯下思考长度限制，只有思考无回复内容问题 #4539 #4760
修复推理异常退出场景下KVCache管理进程残留问题 #4410
修复部分场景默认开启ChunkedPrefill报错问题 #3759
修复调度方法导致DeepSeek模型CudaError问题 #4757
修复XPU多模下默认开启上下文缓存bug #4694
修复MTP与C8场景下模型加载问题 #4077
修复MLA默认开启TensorCore的bug #4354
修复APIServer连接重复初始化的问题 #3901
修复MultiAPIServer日志地址混乱问题 #3967
修复多机张量并行无法部署问题 #4377
修复Qwen-VL系列模型无法关闭思考问题 #3808 #4762
修复APIServer的对话接口非流式返回场景下finish_reason不正确问题 #4582
修复ERNIE-4.5-VL模型ReasoningPaserser中思考结束符错误问题 #4686
修复离线接口enable_thinking强制False的不符合预期错误 #4248
修复ERNIE-4.5-VL对PNG格式透明背景图像的处理问题 #4847
修复rope3d开启FA3下的报错问题 #3791
修复部分硬件平台上算子导入出错问题 #4559
修复PD分离EP并行场景下启动推理服务的多个问题 # 4311 #4420 #4542 #4693 #4781
修复Metrics中num_requests_running, num_requests_waiting, available_gpu_block_num统计不准确的问题 #4404
修复Trace日志在流式输出场景下trace span过多问题 #4375
修复动态C8计算错误问题 #4119
修复AppendAttention作为自定义算子注册下的Bug导致动静不统一问题 #4340
修复Qwen-VL系列模型预处理中视频与图片数据的占位符处理错误 #4065
修复模型组网存在的无用显存浪费问题 #3854
修复思考长度限制在并发场景下的Bug #4296
修复PD分离下IPC信号读取错误问题 #4309
修复metrics指标的共享目录命名冲突问题 #4007
修复昆仑芯barrier随机精度问题 #4181
修复思考长度限制超过上限时的异常问题 #4086

其它

修复沐曦硬件上的单测报错问题 #4027
修复沐曦硬件上的单测报错问题test_get_save_output_v1单测偶发挂的问题 #4732
昆仑芯增加W4A8单测用例 #4501
Config配置代码优化，去除冗余字段 #4147 #4362 #4400
第三方库采用submodule管理 #4033
新增DeepSeek-V3-0324端到端监控 #4360
ERNIE-4.5-VL模型续推字段generated_token_ids改为completion_token_ids #4086
后面进程异常退出时，APIServer进程自动退出提在终端输出提示 #3271
Metrics增加若干可观测性指标 #3868
新增Attention层的性能单测 #4494
DP+EP并行场景下支持模型权重的热更新 #3765 #3803 #3898
支持在训练场景下强制停止推理请求 #3601 #4402
修复在训练场景下Qwen3模型命名映射异常问题 #4338 #4322
修复流式请求max_streaming_response_token参数不起作用问题 #3789
增加基于ZMQ回传worker推理结果至Engine的通信方式 #3521

What's Changed

Add more runtime information to resource manager by @ming1753 in #3706
Add CI cases by @ZhangYulongg in #3714
Add loader test for mtp by @YuanRisheng in #3724
fix typos by @co63oc in #3684
add ci images build job by @XieYunshen in #3749
[DOC] fix Document by @lizexu123 in #3782
Update test_ernie_21b_mtp.py by @ZhangYulongg in #3783
fix test_load_mtp by @co63oc in #3780
[BugFix] Fix chunked prefill by @kevincheng2 in #3759
[BugFix] fix max streaming tokens invalid by @ltd0924 in #3789
[Feature] Setting number of apiserver workers automatically by @Jiang-Jia-Jun in #3790
[Feature] mm and thinking model support structred output by @kevincheng2 in #2749
[Feature] support model weight update in ep by @ltd0924 in #3765
[BugFix] fix error of import paddle.base.core.Config by @yuanlehome in #3761
[Executor] Fix bug of import paddle with RLHF by @gongshaotian in #3781
rename speculate_stop_generation_multi_stop_seqs by @co63oc in #3743
Modify mask_offset‘s format by @carryyu in #3525
rename speculate_token_penalty_multi_scores.cu by @co63oc in #3735
fix ce compile job by @XieYunshen in #3768
[v1loader]Reduce EB300B model loading time by @bukejiyu in #3700
【Fix bug] w4afp8 的nblock固定为256，并且fa3的append attn 增加mask参数 by @yangjianfengo1 in #3771
【Hackathon 9th No.64】add test_draft_model_set_value_by_flags by @Echo-Nie in #3741
[Feat] Support streaming transfer data using ZMQ by @Wanglongzhi2001 in #3521
[BugFix] fix scheduler invalid by @ltd0924 in #3803
rename fused_get_rope.cu by @co63oc in #3752
【Hackathon 9th No.84】Supplementary Unit Test for fastdeploy/reasoning by @Echo-Nie in #3570
fix w8a8.py by @co63oc in #3733
fix dcu_worker.py by @co63oc in #3734
【Hackathon 9th No.73】add unit tests for graph_opt_backend by @ooooo-create in #3609
[XPU] FIX XPU CI BUG by @plusNew001 in #3829
[Doc] update wint2 doc by @chang-wenbin in #3819
fix test_append_attention_with_output.py by @carryyu in #3831
[XPU] Update XPU CI case by @plusNew001 in #3837
qk norm for speculate decode C16 by @rsmallblue in #3637
[V1 Loader]V1 loader support EP by @YuanRisheng in #3801
[Code Simplification] delete cum_offsets_out by @lizexu123 in #3815
[Feature] ernie4_5_vl_moe support huggingface safetensor loading by @aquagull in #3750
add reasoning parser plugin by @luukunn in #3811
reopen ut by @XieYunshen in #3795
Automatically configure workers based on max-num-seqs by @yyssys in #3846
【Hackathon 9th No.43、45】add speculate_get_padding_offset by @co63oc in #3730
【Hackathon 9th No.42】add test_speculate_get_output_padding_offset by @co63oc in #3740
[XPU] Update XPU stable xvllm and xtdk version for 2.2 by @plusNew001 in #3853
【BUG FIX】Fixed moba single test port conflict by @yangjianfengo1 in #3863
fix typo EngineSevice EngineService by @co63oc in #3841
【Hackathon 9th No.27】add test_get_padding_offset by @co63oc in #3708
【Hackathon 9th No.54、57】 add unit tests for per_token_quant and per_token_quant_padding by @ooooo-create in #3746
[BugFix]add rollout config dp by @gzy19990617 in #3822
Support extend block tables by @RichardWooSJTU in #3824
【Hackathon 9th No.34】add test_get_position_ids_and_mask_encoder_batch by @Echo-Nie in #3739
【Hackathon 9th No.63】add test_draft_model_postprocess.py by @co63oc in #3757
[Feature] Set v1 scheduler as default in develop by @rainyfly in #3807
fix response processsors by @RichardWooSJTU in #3826
support mtp rope_3d by @xiaoxiaohehe001 in #3791
[Feature][MTP]support mtp in v1_scheduler mode by @freeliuzc in #3695
Graceful shut down by @xiaolei373 in #3785
Support for async processor added. by @sunlei1024 in #3869
[CI] update paddleformers==0.2 in develop by @EmmonsCurse in #3878
Update test_ernie_21b_mtp.py by @ZhangYulongg in #3885
[BugFix] fix qwen vl processor by @ltd0924 in #3808
[Docs] add data parallel by @ltd0924 in #3883
【Hackathon 9th No.35】add test_moe_redundant_topk_select by @Echo-Nie in #3867
【BugFix】fix gpu mem oom by @gzy19990617 in #3854
【Hackathon 9th No.32】add unit tests for group_swiglu_with_masked by @ooooo-create in #3748
【Inference Optimize】Update MergedReplicatedLinear for DSK qkv_a_proj_with_mqa. by @chang-wenbin in #3673
[fix]load hadamard_block_size from config by @rsmallblue in #3797
[Feature] support controller port in multi api server by @ltd0924 in #3898
Compatible with EB 0.3B torch model arch by @ckl117 in #3913
[Attention]clean_code by @zhoutianzi666 in #3917
[Fix] mv connection_manager init by @ltd0924 in #3901
add cache queue port by @ZhangYulongg in #3904
rename eagle_get_base_model_hidden_states.cu by @co63oc in #3753
[feature]Support model loading from cache by @bukejiyu in #3857
ignore ci by @bukejiyu in #3950
[Feature] add HTTP GET retry by @ApplEOFDiscord in #3838
[XPU]Fixed the issue of performance degradation caused by enabling ENABLE_V1_KVCACHE_SCHEDULER by @iosmers in #3897
[Bug fix] Fix prompt token ids dtype in v1 by @rainyfly in #3860
supports dynamic Cfp8 by @carryyu in #3767
Update sparse attn documentation by @yangjianfengo1 in #3954
[Excutor] Experiment Feature-Support Prefill in cudagraph by @littledgg in #3459
[metrics] Add serveral observability metrics by @qwes5s5 in #3868
[Docs] Update env docs for Machete by @Sunny-bot1 in #3959
rename ep_moe_prefill_func ep_moe_expert_dispatch by @co63oc in #3938
fix typos by @co63oc in #3951
[Optimize]Error messages about Model api. by @AuferGachet in #3839
【Doc】Update WINT2 Doc Pic by @chang-wenbin in #3968
Modify markdown by @xiaolei373 in #3896
[docs] update docs by @yangjianfengo1 in #3975
【Hackathon 9th No.22】add unit tests for share_external_data by @ooooo-create in #3744
【Hackathon 9th No.68】supplementary unit test for ngram_match by @Echo-Nie in #3732
【Hackathon 9th No.44】add test_speculate_get_token_penalty_multi_scores.py by @co63oc in #3742
【Hackathon 9th No.69】add test_draft_model_preprocess by @co63oc in #3832
【Hackathon 9th No.60、62】add eagle_get_hidden_states by @co63oc in #3876
【Hackathon 9th No.66】add test_speculate_set_stop_value_multi_seqs by @co63oc in #3941
【Hackathon 9th No.36】add test_extract_text_token_output by @Echo-Nie in #3862
[docs] update best practice docs by @zoooo0820 in #3969
[XPU]Release2.2 update release note by @iosmers in #3986
【Doc】update dsk doc by @chang-wenbin in #3989
update doc by @bukejiyu in #3990
del batch id per token by @carryyu in #3963
[Docs] update VL best_practices for release/2.2 by @ming1753 in #3965
[CI] update ci by @ZhangYulongg in #3962
[docs] add a3b-thinking doc by @zoooo0820 in #3994
【docs】update index.html and dockfile by @yangjianfengo1 in #3998
【FIX】Change the name of sparse attn from moba to plas by @yangjianfengo1 in #3845
【Fix】Change the name of sparse attn from moba to plas by @yangjianfengo1 in #3993
【docs】 update readme by @yangjianfengo1 in #4000
Revert "【Fix】Change the name of sparse attn from moba to plas" by @Jiang-Jia-Jun in #4002
Revert "【FIX】Change the name of sparse attn from moba to plas" by @Jiang-Jia-Jun in #4001
get org_vocab_size from args by @zeroRains in #3983
[V1 Loader]Ernie kv cache quant support v1 loader by @YuanRisheng in #3899
[V1 Loader] Support V1 Loader for Machete by @Sunny-bot1 in #3999
metrics shared folder naming by @zhuangzhuang12 in #4007
[MoE] clean code by @zhoutianzi666 in #4020
[BugFix] Fix the abnormal memory usage caused by shape errors in the triton moe backend by @yuanlehome in #4026
[xpu] add ep custom ops by @zhupengyang in #3911
[Feat] ernie4_5_vl_moe support CudaGraph by @aquagull in #3226
[Executor] Adjust signal sending order in RL training by @gongshaotian in #3773
【Hackathon 9th No.28】add test_cutlass_fp8_fp8_fp8_dual_gemm_fused by @WanRui37 in #3935
[Fix] fix multi api server log dir by @ltd0924 in #3967
[MTP]support rope_3d in spec mode by @freeliuzc in #4034
[Feature] Support zai-org/GLM-4.5-Air BF16 model by @ckl117 in #3928
【Inference Optimize】DeepSeek-V3-model MLA Optimize by @chang-wenbin in #3886
【Hackathon 9th No.55】add test_update_inputs_v1.py by @co63oc in #3992
[docs] Update environment variables documentation by @bukejiyu in #3957
[BugFix] qwen2.5vl enable_thinking=true and image_patch_id bug fix by @CSWYF3634076 in #3921
fix import tests.utils error in tests/model_loader/test_load_mtp.py by @handsomecoderyang in #4027
[setup optimize]Support git submodule by @YuanRisheng in #4033
[CI] skip test_structured_outputs* temporarily by @EmmonsCurse in #4055
update ci by @ZhangYulongg in #4064
[BugFix] mm_post_fix by @xiaoxiaohehe001 in #4005
[Echo] Support more types of prompt echo by @AuferGachet in #4022
[Feature] add cli command chat,complete by @memoryCoderC in #4037
[bug fix] Fix the placeholder in qwen prompt and add some unittests by @lddfym in #4065
[Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) by @ckl117 in #4051
[Optimize] optimize prefix cache in develop by @rainyfly in #3890
Add token processor plugin support by @RichardWooSJTU in #4059
fix typos by @co63oc in #3840
[metrics] update metrics markdown file by @qwes5s5 in #4061
[CI] add multi api server test by @ltd0924 in #4049
[Feature] refactor metax_gpu attention and moe and remove some useles… by @handsomecoderyang in #3688
【Hackathon 9th No.25】add test_fused_get_rotary_embedding by @Echo-Nie in #3892
【Hackathon 9th No.78】add test_chat.py by @co63oc in #3958
[BugFix]Fix load kv cache quant scale by @YuanRisheng in #4077
[format] Valid para format error info by @xiaolei373 in #4035
[BugFix] Fix image_feature 0-Size causing insert failed by @aquagull in #4042
fix(CE): update concurrency to stop CE tasks from canceling each other by @XieYunshen in #4083
Support offline inference with streaming output by @xyxinyang in #4071
【FastDeploy CLI】collect-env subcommand by @qwes5s5 in #4044
[Bug Fix]fix the bug for cache_messager signal loss by @zeroRains in #3879
【Hackathon 9th No.61、65、41】add test_draft_model_update by @co63oc in #3940
【Hackathon 9th No.49】add test_pre_cache_len_concat by @Echo-Nie in #3847
[Optimize] Support WINT8 and group scale for Machete by @Sunny-bot1 in #3905
[v1 loader]qwen Offline fp8 by @bukejiyu in #4036
[xpu] support ep by @zhupengyang in #4067
[CUDAGraph] Support multi output buffers and merge some fixes from feature/exp_0908 by @yuanlehome in #4062
[MTP]update hybrid-mtp-with-ngram by @freeliuzc in #4047
[MTP]Develop mtp reshard by @freeliuzc in #4099
[BugFix]Fix Ernie bf16 model loading bug and add comments by @bukejiyu in #4106
fix typos by @co63oc in #4093
[BugFix]Fix key mismatch when load mtp by @YuanRisheng in #4105
[submodule] add ignore=all for deepgemm by @yuanlehome in #4118
[BugFix] Fix EP MoE expert dispatch function by @Sunny-bot1 in #4102
【Hackathon 9th No.37】add test_top_k_renorm_probs by @Echo-Nie in #3755
【Hackathon 9th No.52】add test_dynamic_per_token_scaled_fp8_quant by @co63oc in #4015
[CE]add plas attention config by @tianlef in #4128
Addcase by @DDDivano in #4112
[benchmark]add lite-vl and x1 yaml by @xiegegege in #4130
[Doc][CE]x1_a3b server config by @tianlef in #4135
ci: Increase compilation task time limit by @XieYunshen in #4098
fix dynamic Cfp8 computing error by @rsmallblue in #4119
[Feature] Set prefix caching as default by @rainyfly in #3814
Update test_w4a8_model.py by @ZhangYulongg in #4125
mv test to tests by @XieYunshen in #4129
[FDConfig]Remove max_num_batched_tokens/max_num_seqs in parallel config by @YuanRisheng in #4116
[BugFix] Forbiden FD_DISABLED_RECOVER while ENABLE_V1_KVCACHE_SCHEDULER by @Jiang-Jia-Jun in #4142
Reconstruct streaming data transfer with zmq by @RichardWooSJTU in #3836
Print KV Cache available memory and block memory usage in GB format by @qw86972190 in #4148
[BugFix]Fix test_prefix_cache by @YuanRisheng in #4155
[NewFeture]add ep rollout model init and update/clear ep buffer by @gzy19990617 in #4039
[CI] enhance clean port strategy by @EmmonsCurse in #4152
[Feature] Support mixed deployment with yiyan adapter in develop by @rainyfly in #3976
Add param valid log by @xiaolei373 in #4113
[FastDeploy CLI]collect-env unitest bug fix by @qwes5s5 in #4159
[Optimize] Machete uses group scale by default by @Sunny-bot1 in #4121
Bugfix test exception by @xiaolei373 in #4171
Each module should have its own plugins_loaded by @yuanlehome in #4164
[Logprob] EP support logprob by @ckl117 in #4151
[fix]Modify follow-up push parameters and Modify the verification method for thinking length by @luukunn in #4086
[FDConfig]Remove splitwise_role and engine_worker_queue_port in FDConfig by @YuanRisheng in #4147
【Hackathon 9th No.46】add test_fused_rotary_position_encoding by @Echo-Nie in #3848
[Bug fix] fix request assign by @RichardWooSJTU in #4163
[TEST] init first commit by @cqulilujia in #4192
fix nul by @co63oc in #4191
[BugFix]fix glm all_reduce tp group by @ckl117 in #4187
[Feature] support pool by @lizexu123 in #3827
fix typos by @co63oc in #4176
【Hackathon 9th No.30】add test_tritonmoe_preprocess by @Echo-Nie in #3891
[Feature] Support pd ep deployment with yiyan adapter by @rainyfly in #4029
【Hackathon 9th No.40】add test_top_p_candidates by @co63oc in #4046
【Hackathon 9th No.26】add test_set_value_by_flags_and_idx.py by @Echo-Nie in #4186
[FD CLI] Add bench cli by @ZhangYulongg in #4160
[Iluvatar GPU] Optimize attention performance and fix moe load ckpt e… by @wuyujiji in #3651
Remove useless code by @Jiang-Jia-Jun in #4195
[Feature] support clear data by @ltd0924 in #3601
[XPU]change xpu ci model by @plusNew001 in #4117
【FIX】Change the name of sparse attn from moba to plas (#4006) by @yangjianfengo1 in #4076
[XPU] update XPU CI by @plusNew001 in #4209
[XPU] Update run_ci_xpu.sh to lock xvllm version by @plusNew001 in #4210
[xpu] use cpu barrier by @zhupengyang in #4181
[Feature] support qwen3-embedding model load by @lizexu123 in #4202
Fix noaux_tc cuda Error 700 in CUDAGraph by @ckl117 in #4174
[v1 loader]code style by @bukejiyu in #4204
[Test]add glm45_air logprob test and rollout model by @ckl117 in #4175
[XPU] Enable XPU V1 mode based on environment variable by @yyssys in #4213
register_model_class compatible with plugins by @yuanlehome in #4236
【Hackathon 9th No.24】add rebuild_padding by @co63oc in #4107
[Intel HPU] Support intel hpu platform by @fmiao2372 in #4161
[CUDAGraph] [FIX] Fix CUDA error(700): 'cudaErrorIllegalAddress' in CascadeAppendW… by @YuhanXu in #4218
[BugFix]fix v1 loader moe bf16, and supoort dynamic_load_weight create quant param by @ckl117 in #4229
[BugFix] fix qwen3-embedding model tp>1 by @lizexu123 in #4223
[Bug Fix] disable prefix caching in mm model by @ApplEOFDiscord in #4167
[Feature] add cli command serve by @memoryCoderC in #4226
[OPs] MoE support wfp8afp8(channelwise) and improve per_token_quant_fp8 by @ckl117 in #4238
[fix]update apply_chat_template by @luukunn in #4137
[Model] Qwen2.5VL support --use-cudagraph and unit testing by @CSWYF3634076 in #4087
[CUDAGraph]CUDA Graph support unique memory pool by @gongshaotian in #4230
[BugFix]fix the bug for prefilled_step_idx signal of cache_messager in cudagraph and PD by @zeroRains in #4235
【Hackathon 9th No.21、23】add unit tests for fused_hadamard_quant_fp8, moe_fused_hadamard_quant_fp8 by @ooooo-create in #4094
[XPU] support XPU VL model inference by @cqulilujia in #4030
delete moe_phase in parallel_config（Moved to model_config） by @yuanlehome in #4264
Support limit thinking lengths. by @K11OntheBoat in #4069
[Docs]Add ENABLE_V1_KVCACHE_SCHEDULER=0 to docs by @yyssys in #4268
【fix】Remove the logic that assigns the default value of 80% to reasoning_max_tokens in the offline component of FastDeploy. by @kxz2002 in #4248
[Feature] add config api by @memoryCoderC in #4254
[CI] fix base_test error temporarily by @EmmonsCurse in #4283
[Supplements and upgrades]Improvement of X1 parsers by @AuferGachet in #4172
fix ernie vl distributed attr. by @ZHUI in #4215
[Doc]add glm benchmark yaml by @tianlef in #4289
Add cli run batch by @xiaolei373 in #4237
Add speculative decoding approval check by @Deleter-D in #4284
Set approve checking for config.py, worker, model and cudagraph by @zeroRains in #4276
[Docs]The XPU model loader uses the default version by @yyssys in #4292
increase ccache size by @XieYunshen in #4255
[Feature] deepgemm pre-compile tool support mixed parallel by @Deleter-D in #4282
fix typos by @ccsuzzh in #4274
[fix]remove reasoning_max_tokens=max_toksns*0.8 in sampling_params by @luukunn in #4277
[Bug fix] Fix bug for running ep by @rainyfly in #4245
Fix wrong batch size of thinking_mask by @K11OntheBoat in #4296
[BugFix] Increase the conditions for the use of a Machete: not pre-quant by @Sunny-bot1 in #4295
[XPU] fix VL thinking mode by @cqulilujia in #4266
[feat] support prefix cache clearing when /clear_load_weight is called by @liyonghua0910 in #4008
add_cli_tokenizer by @xiaolei373 in #4278
[fix] fix gpu_cache_kvs key by @liyonghua0910 in #4311
【Feature】ResourceManagerV1 support need block num notifying by @RichardWooSJTU in #4220
Fix bugs of splitwise_complete_prefilled_step IPCsignal clear by @K11OntheBoat in #4309
[Metax] support cutlass moe & optimize flash attention & fix triton moe by @xiaozude in #4208
[NewFeature]custom_allreduce support cudagraph recapture by @ckl117 in #4305
[FIx] CI Approve fix by @zeroRains in #4316
[BugFix]remove redundant includes by @fangfangssj in #4312
[Bug fix]revert worker process ipc signal suffix to fix ep by @RichardWooSJTU in #4323
【Inference Optimize】Support MLA_CACHE & Fix V1_Schedule Bug by @chang-wenbin in #4318
【Fix】updata docs by @yangjianfengo1 in #4339
[Doc] Update xpu fastdeploy version to 2.2.1 by @yyssys in #4338
【Bug-Fix】schedule_bugfix by @chang-wenbin in #4336
[Executor]CUDAGraph support Speculate Decode by @gongshaotian in #3769
Remove redundant inplace outputs for append_attention by @SigureMo in #4340
【Inference Optimize】MLA Tensor-Core is enabled by default by @chang-wenbin in #4335
【Hackathon 9th No.86】autogen MultiQueryAppendC8Attention template_instantiation -part by @ccsuzzh in #4330
[XPU] Support W4A8C8-TP4-300B Model by @iosmers in #4068
supports spec dynamic cfp8 by @carryyu in #4290
[FastDeploy Cli] Bench Command eval and throughput by @qwes5s5 in #4239
add release images build job by @XieYunshen in #4265
【Bug Fix】mla enables tensorcore by default by @chang-wenbin in #4354
【Inference Optimize】Calculate paddle_peak_increase using paddle_allocated_mem_after_run by @chang-wenbin in #4355
【Add CI】Add DeepSeek model end-to-end CI by @chang-wenbin in #4360
[Feature] support prefix cache in DP by @ltd0924 in #4359
【BugFix】fix qwen3moe name_mapping config by @gzy19990617 in #4348
[MTP]support more branchs in topp kernel by @freeliuzc in #4352
[FDConfig]Remove max_model_len in FDConfig by @YuanRisheng in #4350
[XPU] fix XPU CI bug by @plusNew001 in #4358
[Doc] fix document navigation link paths by @yyssys in #4368
【Hackathon 9th No.20】add unit tests for masked_per_token_quant by @ooooo-create in #4111
[Doc] fix the port conflict issue in the usage example by @EmmonsCurse in #4379
[Optimization] Fuse get_max_len and get_kv_max_len by @Sunny-bot1 in #4369
[CI] fix diff_error temporarily by @EmmonsCurse in #4390
【Hackathon 9th No.67】add speculate_verify by @co63oc in #4326
[Doc]add x1 a3b quantization yaml by @tianlef in #4397
[Doc] fix offline inference doc by @ApplEOFDiscord in #4412
[Docx] add PaddlePaddle nightly build address for GPU by @yangjianfengo1 in #4414
[CI] Fix partial instability issues by @EmmonsCurse in #4418
[benchmark] Update benchmark tools by @ZhangYulongg in #4416
[Optimization] Optimize split_q_block kernel by @Sunny-bot1 in #4367
[XPU] fix ep by @zhupengyang in #4393
[BugFix] fix multinode bugs by @ltd0924 in #4377
[fix] Fixed the issue of excessive/redundant spans being returned for streaming requests. by @qwes5s5 in #4375
[fix] fix requests & block metrics by @liyonghua0910 in #4404
[MTP]support mtp chunk_prefill_v1 by @freeliuzc in #4366
【BugFix】fix block_wise_fp8_v1_loader_moe_shape by @ckl117 in #4384
【Fix CI Bug】Fix ci bug by @chang-wenbin in #4413
Disable gcu ci by @tianshuo78520a in #4427
V1 loader default by @bukejiyu in #4251
[BugFix] fix workers=1 by @ltd0924 in #4364
[XPU] fix VL multi-batch accuracy issue by @cqulilujia in #4394
[BUGFIX] clear request #4286 by @ltd0924 in #4402
fix param by @freeliuzc in #4419
Feature：Add support for Pooling Model Embedding and provide an OpenAI-compatible API. by @sunlei1024 in #4344
[benchmark] Update benchmark_serving.py by @ZhangYulongg in #4438
[BugFix] fix config bugs by @ltd0924 in #4370
[XPU] support prefix cache by @ddchenhao66 in #4423
[Bug fix] Fix pd for x1 thinking by @rainyfly in #4433
[Bugfix]fix ep clear buffer perf by @gzy19990617 in #4389
[XPU] moe support VL 0-size input by @cqulilujia in #4408
[benchmark] Fix benchmark duration calculation logic by @ZhangYulongg in #4446
[benchmark] Add filtering for failed requests in benchmark outputs by @ZhangYulongg in #4448
perf: optimize ZMQ communication with async queue and single-threaded… by @sunlei1024 in #4444
[BUG] fix ep bug by @kevincheng2 in #4275
【Hackathon 9th No.86】autogen MultiQueryDecoderAttention template_instantiation -part by @ccsuzzh in #4383
[benchmark ] Update benchmark tools by @ZhangYulongg in #4454
【Fix】 remove text_after_process & raw_prediction by @LiqinruiG in #4421
[Intel HPU] Enable dist sampler on intel hpu platform by @JianyuLi01 in #4445
[xpu] refine fused_moe by @zhupengyang in #4219
[SOT][CUDAGraph] Add support for custom all-reduce operators under SOT mode by @DrRyanHuang in #4386
[FDConfig]Remove total_block_num/dtype/block_size/enc_dec_block_num in ParallelConfig by @YuanRisheng in #4400
[bugfix] kill cache_transfer_manager process by @xiaolei373 in #4401
[BugFix]Fix wfp8afp8 triton moe group_topk renormalized=True by @ckl117 in #4449
[Comments] Modify comments to English by @xiaolei373 in #4460
[FDConfig]Remove reasoning_parser/guided_decoding_backend/disable_any_whitespace/device_ids in FDConfig by @YuanRisheng in #4362
[SOT][Cudagraph] Remove BreakGraph of #3302 && update CustomOp by @DrRyanHuang in #3694
[Optimization] Put get_block_shape_and_split_kv_block in cuda graph for append attention backend by @Sunny-bot1 in #4443
[Others] add PR Template by @zeroRains in #4452
[Optimize] Set preempted schedule log as info level by @rainyfly in #4453
[SOT] Change warnings to errors and remove fallback operations by @DrRyanHuang in #4378
[BugFix]Dev fix custom ar unstable result by @ckl117 in #4437
[CINN] Remove the restriction of automatically falling back to SOT after enabling CINN by @DrRyanHuang in #4411
[Feature] support pooling model dummy_run by @lizexu123 in #4345
[XPU] abstract a hardware-agnostic operator wrapper for prefix cache and specify xpu device id definition by @ddchenhao66 in #4455
[CI] Fix partial instability issues by @EmmonsCurse in #4461
[Docs]add environment variables to environment_variables.md by @xiaolei373 in #4466
[Perf][Qwen25_VL] Avoid triggering CPU synchronization in ViT's attn forward by @aquagull in https://github.com/PaddlePaddle/FastDeploy/pull/4442
[Docx] add language (en/cn) switch links by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4470
[Iluvatar GPU] Adapt VL model by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/4313
[Loader]check paddle version for v1 loader by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4473
[XPU]Fix w4a8 precision bug && rollback moe algo by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4463
[Docx] fix the broken link by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4479
LLM.chat add "tools" param by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4415
【feature】support n parameter by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4273
[ATTN]delete code and add ffn and moe layer level test by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4440
[CI] Handle unit test issues by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4483
[Graph Optimization][Speculative Decoding] Fix the bug of CUDAGraph + MTP + EP by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4456
Optimization of ‘tools’ in request fields by @AuferGachet in https://github.com/PaddlePaddle/FastDeploy/pull/4380
[Metax] adjust mctlass moe api by @handsomecoderyang in https://github.com/PaddlePaddle/FastDeploy/pull/4474
Support GPT-OSS-BF16 by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4240
[Feature] support mtp logprob by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4464
[Loader]Qwen2.5-Math-PRM-7B and Ernie-VL-RM by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4319
[fix] remove cache tensor creation for cache_transfer_manager by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4420
[Benchmark] update benchmark scripts by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4497
[XPU]Fix w4a8 garbled code issue by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4493
[XPU] Fix vl multi-card allreduce bug by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4485
[BugFix][CI] Clean up SOT code cache using tearDown in CINN unitest by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4491
Optimizing the performance of think length limit using custom operators by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4279
[APIServer] support define gunicorn timeout by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4496
[CI] update ernie-4_5-vl baseline by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4495
【BugFix】fix ep buffer clear by @gzy19990617 in https://github.com/PaddlePaddle/FastDeploy/pull/4450
[XPU] bind block_attn kernel with pybind by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4499
[Executor] Default use CUDAGraph by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/3594
[Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4467
【CI】Add test cases for n parameter and streaming validation by @DDDivano in https://github.com/PaddlePaddle/FastDeploy/pull/4503
[Fearture] Support mm model close prefix cache by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4459
[Doc]add deepseek wint4 ce by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/4517
[FDConfig]Turn on the CUDAGraph + Speculative Decoding switch by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4511
Add comprehensive unit tests for limit_thinking_content_length operators by @Copilot in https://github.com/PaddlePaddle/FastDeploy/pull/4510
[XPU]add xpu ci ep case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4432
feat: add post-processing step for pool_output by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4462
[FDConfig]Turn on the CUDAGraph + MultiModel switch by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4512
[Iluvatar GPU] fix ci error caused by rebuild_padding param and cuda graph by @wuyujiji in https://github.com/PaddlePaddle/FastDeploy/pull/4504
enhance set_stop_value_multi_ends and standardize the registration of some operators by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4525
[CI] Remove redundant .coveragerc file and fix by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4521
[XPU]Modify the xpu memory display unit of log by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4534
[Feature] support fd return decode response by @zhuangzhuang12 in https://github.com/PaddlePaddle/FastDeploy/pull/4407
[Feature] Support AsyncLLM by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4458
[CI] stable test_rollout_model.py by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4536
small change in test_fusedmoe.py by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4538
c++ code format by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4527
[XPU] Change XPU stable third-party version and add time-out by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4524
[FDConfig]Turn on the CUDAGraph + RL switch by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4508
[Others]Delete useless code by @YuanRisheng in https://github.com/PaddlePaddle/FastDeploy/pull/4544
[BugFix]Fix finish reason by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4543
[Doc]fix deepseek ce by @tianlef in https://github.com/PaddlePaddle/FastDeploy/pull/4560
[XPU] xpu support think length limit by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4539
[CI] Optimize coverage upload reporting by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4547
WINT4/WINT8 dense gemm default use Machete by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4451
[XPU] merge apply_tp, ops support token_num = 0 by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4507
[BugFix] Fix decode_type which has been deleted in req and optimize token client retry scheme by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4564
[Graph Optimization] Support CUDAGraph Padding + MTP by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4545
[FDConfig]Turn on the CUDAGraph + PD Disaggregation switch by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4530
[KVCache] Support Static C8 by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4568
[Metax] adapt DeepSeek by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/4498
[BugFix] fix create_cache_tensor for ep by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4542
[EP] fix adapter bugs by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4572
[XPU]fix v1 hang bug by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4573
[BugFix] fix import image_ops error on some platforms by @zoooo0820 in https://github.com/PaddlePaddle/FastDeploy/pull/4559
[CLI]Update parameters in bench latecy cli tool and fix collect-env cli tool by @qwes5s5 in https://github.com/PaddlePaddle/FastDeploy/pull/4558
[Graph Optimization] Add dy_runnable and introduce cudagraph_switch_threshold for cudagraph mode switching by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4578
[XPU]Moe uses a new operator by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4585
[Feature] Support Paddle-OCR by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4396
[DataProcessor] add reasoning_tokens into usage info by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4520
perf: Optimize task queue communication from engine to worker by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4531
[CI] Clean up ports after processing results by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4587
[CI] Add /re-run command in PR comments to restart failed CI workflows by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4593
[Others] api server exits when worker process is dead by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/3271
[XPU] bind some OPs for VL model with pybind by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4522
[V1 loader] Qwen25 VL support v1 loader and torch style safetensors load by @CSWYF3634076 in https://github.com/PaddlePaddle/FastDeploy/pull/4388
[Feature] Support logprobs_mode by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4567
[CI] Fix path error of /re-run by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4606
[Feature] mm support prefix cache by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4134
[Graph Optimization]1.fix the bug of draft model with ep 2.fix sampler bug by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4589
[XPU] update kunlun doc about supported models by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4586
benchmark工具适配SGLang框架 by @ophilia-lee in https://github.com/PaddlePaddle/FastDeploy/pull/4607
[Unitest]Add unitest of Attention Layer by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/4494
remove dev sync in prefill by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4598
[BugFix] fix offline stream output when set enable_thinking param by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4603
[BugFix] PaddleOCR-VL fix FD_DEBUG type and support v1 loader by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4605
[Feature] EngineWorkerQueue anonymous port by @ST-XX in https://github.com/PaddlePaddle/FastDeploy/pull/4597
[Docs] Add cli usage to docs by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4569
[CI] fix run-batch port from env in unittest by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4613
[CI] Relocate server test cases from ci_use directory to e2e by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4608
[Graph Optimization][Speculative Decoding] Update yaml and fix typo by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4612
Extend sleep time to 10 seconds in switch_service by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4618
[Speculative Decoding][MTP]Support mtp in epdptp mode by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4614
[BugFix] fix TPDP mix parallel infer by @lizhenyun01 in https://github.com/PaddlePaddle/FastDeploy/pull/4583
[Graph Optimization] Fix IR graph dependency error exposed after enabling SOT by updating the return value of TextImageGatherScatter by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4610
[XPU]add xpu ci w4a8 case by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4501
[CI][BugFix] fix port conflicts in concurrent ci test and add more unit test on async_llm by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4616
[CI] Revert directory change of test_rollout_model due to intermittent failures by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4626
feat: add support for API usage with multimodal models by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4548
[XPU] Support PaddleOCR-VL model for XPU by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4529
[Graph Optimization] Refactor default capture list by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4617
[BugFix] fix paddleocr prefix cache bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4625
[BugFix] fix import jit.marker.unified by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4622
add einops dependency by @zhang-prog in https://github.com/PaddlePaddle/FastDeploy/pull/4633
[Feature] support logits processors by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4515
[Feature] support reward api by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4518
[BugFix] fix total_block_num init error in worker_process by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4553
[BugFix] Fix graph opt test case by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4634
[Feature] add mm token usage by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4570
[XPU] Update the return value of TextImageGatherScatter by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4636
[Docs] Add PaddleOCR-VL-0.9B best practices by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4658
[XPU] fix pos_emb_type bug by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4638
[Docs] add Qwen25vl yaml by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/4662
[Feature] add a new reasoning parser by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4571
[XPU] [CI] Increase pytest timeout for XPU ep test by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4665
add noaux_tc to unitest fused_moe by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4656
[EP] fix several bugs in data parallel by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4657
[OP] Add InferShape&InferDtype for per_token_quant_padding by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4667
【Hackathon 9th No.86】autogen MoeFastHardamardImplWrapper template_instantiation by @ccsuzzh in https://github.com/PaddlePaddle/FastDeploy/pull/4592
[UT] Add ut for speculative sampler by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4650
[Doc] update docs by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4675
[Graph Optimization] Add the CUDAGraph usage switch for Draft Model by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4601
[CI] Add test for paddleocr_vl by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4627
[unitest]add real gate_correction_bias weight to mock real data dispatch by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4676
[noauxtc_kernel] remove useless code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4643
[BugFix] fix offline llm chat "enable_thinking" is always "False" by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4686
[BugFix] fix total_block_num init error in worker_process and test_async_llm not throw error by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4687
[BugFix] fix --logprobs-mode raw_logits by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4681
[XPU] xpu currently disable prefix cache for VL model by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4695
[XPU] [CI] Add Vl case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4649
[BugFix] Fix finish reason in _create_chat_completion_choice by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4582
[Feature] Unify the registration name recognition for tool_parser and reasoning_parser to “-” by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4668
[BugFix] fix unittest of get_save_output_v1 by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/4701
[XPU] [CI] Lock xvllm version by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4715
[Graph Optimization] SOT+CUDAGraph support ERNIE4.5T VL 28B / 424B by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4645
[Feature] support mtp distribution equivalence verification by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4699
[KVCache] Support kv cache scale load by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4624
add flops and bandwidth to test_ffn.py by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4704
benchmark工具支持受限解码场景指定response_format by @ophilia-lee in https://github.com/PaddlePaddle/FastDeploy/pull/4718
[CI] add missing unit tests for tokenizer_cli by @xiaolei373 in https://github.com/PaddlePaddle/FastDeploy/pull/4620
[Scheduler] update v1 prefill batch by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4611
[BugFix] Fix profile run in pd-disaggregated deployment by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4584
[BugFix] fix mm prefix_cache cuda error bug by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4679
[Feature] Check bos url by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4711
[BugFix] fix wint2 config by @chang-wenbin in https://github.com/PaddlePaddle/FastDeploy/pull/4721
[FDConfig] [PD Disaggregation] [Graph Optimization] Close Cudagraph for P node when PD Disaggregation by @littledgg in https://github.com/PaddlePaddle/FastDeploy/pull/4632
[XPU] xpu support neox style ROPE by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4719
[BugFix] Skip building native architecture when specifying arch list by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4727
fix noaux by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4731
[BugFix] fix thinking bug by @yuanlehome in https://github.com/PaddlePaddle/FastDeploy/pull/4710
[CI] Fix rollout_model test logic by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4730
[Feature] support pooling model runner by @lizexu123 in https://github.com/PaddlePaddle/FastDeploy/pull/4590
format code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4720
[CI] fix some ci yaml by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4747
[Docs]Update XPU document version to 2.3.0 by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4741
[Speculative Decoding][MTP]Support mtp in splitewise and scheduler_v1 mode by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4743
[Speculative Decoding][MTP]Support attn mask offset by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4641
[Docs]Add parameter to the start service command by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4753
[Docs]Add parameter by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4755
[Docs] fix PaddleOCR-VL docs bug by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4702
[Feature] Support eplb for fd by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4599
[XPU] add v1 support for bf16 by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4744
【DataProcessor】add options thinking_mode by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4735
[Optimize] Support and robust for tpN for PD by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4595
[Docs] fix error by @yyssys in https://github.com/PaddlePaddle/FastDeploy/pull/4768
[CI]test common model by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4697
[Metax] adapt cutlass moe for ernie-vl by @neilzhuu in https://github.com/PaddlePaddle/FastDeploy/pull/4685
fix dynamic Cfp8 for RL load by @rsmallblue in https://github.com/PaddlePaddle/FastDeploy/pull/4144
[Docs] PaddleOCR-VL add RTX3060 server param by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4765
[BugFix] fix deepseek cuda error by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4739
[XPU][CI] fix ci base value bug by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4783
[OP]Fix attn_params by @freeliuzc in https://github.com/PaddlePaddle/FastDeploy/pull/4787
[CI]delete test_common_model by @bukejiyu in https://github.com/PaddlePaddle/FastDeploy/pull/4794
[XPU] fix thinking bug where output only contains reasoning_content by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4761
[XPU] add deployment doc for PaddleOCR-VL in XPU by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4784
[BugFix] Fix ernie4_5_vl_processor.py and qwen_vl_processor.py can not disable thinking by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4762
supports internode_ll_two_stage by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/4162
supports pd partn by @carryyu in https://github.com/PaddlePaddle/FastDeploy/pull/4615
[Docs] Add new support models by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4801
[CI] Refactor CE wheel upload for multiple target paths by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4790
[Docx] update mkdocs.yml by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4804
[BugFix] Fix step_shm_value in PD disaggregated deployment by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4780
Update Unit Test for PaddleOCR-VL by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4802
[Metax] adapt cutlass moe and fix mla attention for DeepSeek by @xiaozude in https://github.com/PaddlePaddle/FastDeploy/pull/4602
[Feature][Executor] GPU Model Runner Supports prompt_logprobs and max_logprobs by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4769
[get_padding_offset.] clean get_padding_offset.cu by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4777
support ep+tp at op layer by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4688
[BugFix] fix reasoning parser register name by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4795
remove input_ids from ForwardMeta by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4793
[Feature] Add timestamp for profiler by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4726
[XPU]Support V1 loader in weight_only Model by @iosmers in https://github.com/PaddlePaddle/FastDeploy/pull/4808
[Bug Fix] process transparent image by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4807
add paddleocr_vl benchmark by @zhang-prog in https://github.com/PaddlePaddle/FastDeploy/pull/4833
[Doc] Update docs for v2.3.0rc0 by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/4828
[BugFix] fix messages being inplace modified in offline chat api by @liyonghua0910 in https://github.com/PaddlePaddle/FastDeploy/pull/4831
【New Feature】W4afp8 supports per group quantization by @yangjianfengo1 in https://github.com/PaddlePaddle/FastDeploy/pull/4272
[CI] fix docker_build error and add tag-base by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4810
[PD Disaggregation] Support Qwen3-MoE use PD + EP inference. by @K11OntheBoat in https://github.com/PaddlePaddle/FastDeploy/pull/4691
remove seq_lens_this_time by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4821
[BugFix] Fix ernie_vl_reasoning_parsers.py 'end_token' to 'think_end_token' by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4805
Fix: ci port conflict by @sunlei1024 in https://github.com/PaddlePaddle/FastDeploy/pull/4840
[CI] Add unittest for activation, native_paddle_backend, w4a8, w4afp8, platforms/utils by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4812
[XPU][CI]Change ci vl model to 28 b by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4764
[Fix] fix ernie4_5_vl model torch format loadding by @aquagull in https://github.com/PaddlePaddle/FastDeploy/pull/4447
[Feature] [PD] add simple router and refine splitwise deployment by @juncaipeng in https://github.com/PaddlePaddle/FastDeploy/pull/4709
[Docs] fix: correct typo in nvidia_gpu.md by @playaswd in https://github.com/PaddlePaddle/FastDeploy/pull/4848
[BugFix] Fix list to List by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4818
[BugFix] Del get_act_fn, _load_st_projector by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4824
[Benchmark] Enhance benchmark output logging by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4682
[XPU] ep+tp all2all by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4836
[CI] Add Check PR Template by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4481
Revert "【New Feature】W4afp8 supports per group quantization" by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4854
[CI] Update deploy.py by @ZhangYulongg in https://github.com/PaddlePaddle/FastDeploy/pull/4850
[CI] Optimize port cleanup logic by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4860
[Bug Fix] fix ernie4_5_vl_moe by @LokeZhou in https://github.com/PaddlePaddle/FastDeploy/pull/4843
Revert "[Bug Fix] fix ernie4_5_vl_moe" by @Jiang-Jia-Jun in https://github.com/PaddlePaddle/FastDeploy/pull/4863
[Feature] support mm disable_chunked by @kevincheng2 in https://github.com/PaddlePaddle/FastDeploy/pull/4803
[CI] Update ERNIE-4.5-VL baseline to adapt to MoE changes by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4867
[CI] Refactor check-bypass logic in run_tests_with_coverage by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4655
[Others] Delete PaddleOCR Useless Function by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4815
[Feature] Optim PaddleOCR-VL by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4873
[XPU] fix ep_tp all2all ci by @zhupengyang in https://github.com/PaddlePaddle/FastDeploy/pull/4876
[XPU] modify 424B model deployment parameter by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4888
[XPU][CI] Ci bug fix by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4889
[BugFix] fix token_processor zmq by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4827
[CI] fix docker_build error of ciuse by @EmmonsCurse in https://github.com/PaddlePaddle/FastDeploy/pull/4886
[Metax] support ERNIE-4.5-VL-28B by @neilzhuu in https://github.com/PaddlePaddle/FastDeploy/pull/4820
[BugFix] max_lgprobes=-1 maps to ori_vocab_size by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4884
[Feature] Enable FastDeploy to support adding the “--api-key” authentication parameter. by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4806
[Docs]Supplement the English and Chinese user documentation for Tool calling by @AuferGachet in https://github.com/PaddlePaddle/FastDeploy/pull/4895
[XPU][CI]Update test assertion and base response value by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4907
[BugFix] When the value of "temperature" is 0, adjust it to 1e-06 by @luukunn in https://github.com/PaddlePaddle/FastDeploy/pull/4900
[Docs] add api-key usage instructions by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/4902
[CI] Add four unittest by @Echo-Nie in https://github.com/PaddlePaddle/FastDeploy/pull/4906
[Bug Fix] fix bug for PD EP by @rainyfly in https://github.com/PaddlePaddle/FastDeploy/pull/4823
[DeepEP] support async prefill by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4899
[XPU]Update documentation by @qw86972190 in https://github.com/PaddlePaddle/FastDeploy/pull/4917
[Docs] Improve reasoning_out docs by @LiqinruiG in https://github.com/PaddlePaddle/FastDeploy/pull/4901
[BugFix] Fix inference_start_time by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4922

New Contributors

@ooooo-create made their first contribution in #3609
@zhupengyang made their first contribution in #3911
@WanRui37 made their first contribution in #3935
@handsomecoderyang made their first contribution in #4027
@fmiao2372 made their first contribution in #4161
@YuhanXu made their first contribution in #4218
@ccsuzzh made their first contribution in #4274
@xiaozude made their first contribution in #4208
@fangfangssj made their first contribution in #4312
@tianshuo78520a made their first contribution in #4427
@JianyuLi01 made their first contribution in #4445
@Limerances made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4240
@ST-XX made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4597
@zhang-prog made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4633
@neilzhuu made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4685
@juncaipeng made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4709
@playaswd made their first contribution in https://github.com/PaddlePaddle/FastDeploy/pull/4848

Full Changelog: v2.2.1...v2.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2.3.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

新增功能

性能优化

多硬件

文档

Bug修复

其它

What's Changed

New Contributors

Contributors

Uh oh!