-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Summary
FlashInfer v0.6.4 has been released with new features and performance improvements. The current pin is v0.5.3 in docker/Dockerfile.cuda line 64.
Key Changes
New Features
- MXFP8 GEMM Support - New
mm_mxfp8kernel for mixed-precision inference - TRTLLM-Gen Skip-Softmax Kernels - For both prefill and decode with MLA support
- SM120 (Blackwell) Improvements:
- FP4 GEMM tile configs with streamK scheduler
- Fixed fp8_blockscale_gemm in AOT jit-cache
- Sampling CUDA Graph Fix - Resolves issues with CUDA graph mode
- Hopper CI Support - Added Hopper architecture to continuous integration
Performance Improvements
- Cache
cudaGetDevicePropertiesingdn_prefillto avoid per-call overhead - Parallel testing support in unit test script
- Improved GDN decode with cuteDSL kernel
- FP4 one-shot launch config stability fix in allreduce fusion
Bug Fixes
- Fix for W4A8 autotune crash in cutlass_fused_moe profiler
- Fix enum/int type mismatch in sampling
- InternVL and Qwen3N test case additions
- Video bucketing and warmup fixes
Files Affected
Per docs/upstream-versions.md:
- docker/Dockerfile.cuda line 64 (
FLASHINFERtag)
Recommended Actions
- Review compatibility with vLLM commit d7de043d55d1dd629554467e23874097e1c48993
- Check for FlashInfer API usage in vLLM integration:
grep -r "flashinfer" docker/ --include="*.py" --include="Dockerfile*"
- Update Dockerfile:
# Line 64 in docker/Dockerfile.cuda ARG FLASHINFER_VERSION=v0.6.4
- Test container build with new FlashInfer version
- Run E2E tests focusing on attention kernels and MoE operations
- Validate Blackwell/Hopper GPU support if using SM120/SM90 hardware
Impact
MEDIUM - Minor version update with new features but no breaking changes detected. New kernels and optimizations may improve performance.
Upstream References
- Release: https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.6.4
- Full Changelog: flashinfer-ai/flashinfer@v0.5.3...v0.6.4
> Generated by Upstream Dependency Monitor
Generated by Upstream Dependency Monitor
Reactions are currently unavailable