Optimize FP8 Triton kernels for better performance by yurekami · Pull Request #1066 · deepseek-ai/DeepSeek-V3

yurekami · 2025-12-25T13:13:15Z

Summary

This PR addresses Issue #1052 by optimizing the Triton FP8 kernels for improved performance and correctness.

Changes

1. act_quant_kernel improvements:

Added boundary masking for partial blocks to prevent out-of-bounds memory access
Added n_elements parameter for proper boundary handling
Extracted FP8_E4M3_MAX constant (448.0) for code clarity

2. fp8_gemm_kernel optimizations:

Extended autotuning configurations with larger block sizes (up to 128x256)
Added dynamic num_warps calculation based on block dimensions for optimal GPU occupancy
Added M dimension to autotune key for better configuration selection
Introduced explicit stride parameters for flexible memory layouts
Improved code documentation

3. Autotuning configuration improvements:

Expanded block_m options: [16, 32, 64] → [16, 32, 64, 128]
Expanded block_n options: [32, 64, 128] → [32, 64, 128, 256]
Reduced num_stages options: [3, 4, 5, 6] → [3, 4, 5] for faster tuning
Added tile size limit (16384 elements) to avoid excessive register pressure

Expected Benefits

Better memory access patterns: Explicit strides allow the compiler to generate more efficient code
Improved GPU utilization: Dynamic num_warps adapts to block size for optimal occupancy
Faster autotuning: Reduced configuration space while covering important cases
Correctness: Boundary masking prevents out-of-bounds access in edge cases

Test plan

Python syntax validation passes (python3 -m py_compile inference/kernel.py)
Run inference with sample input to verify output correctness
Benchmark on representative workloads to measure performance improvement

Related Issues

Closes #1052

🤖 Generated with Claude Code

This PR addresses Issue deepseek-ai#1052 with the following improvements: 1. act_quant_kernel: - Added boundary masking for partial blocks to prevent out-of-bounds access - Added n_elements parameter for proper boundary handling - Extracted FP8_E4M3_MAX constant for clarity 2. fp8_gemm_kernel: - Extended autotuning configs with larger block sizes (128x256) - Added dynamic num_warps calculation based on block dimensions - Added M to autotune key for better config selection - Introduced explicit stride parameters for flexible memory layouts - Improved code documentation 3. Autotuning improvements: - Expanded block_m options: [16, 32, 64] -> [16, 32, 64, 128] - Expanded block_n options: [32, 64, 128] -> [32, 64, 128, 256] - Reduced num_stages options: [3, 4, 5, 6] -> [3, 4, 5] for faster tuning - Added tile size limit (16384) to avoid excessive register pressure 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

esball1 · 2026-01-16T15:10:30Z

yeah, just, you know... did you even optimize something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize FP8 Triton kernels for better performance#1066

Optimize FP8 Triton kernels for better performance#1066
yurekami wants to merge 1 commit intodeepseek-ai:mainfrom
yurekami:optimize-fp8-triton-kernels

yurekami commented Dec 25, 2025

Uh oh!

esball1 commented Jan 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yurekami commented Dec 25, 2025

Summary

Changes

Expected Benefits

Test plan

Related Issues

Uh oh!

esball1 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

esball1 commented Jan 16, 2026 •

edited

Loading