You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tune attention perf to align with IPEX attention functions (vllm-project#162)
* tune perf for decoding kernel
Signed-off-by: baodii <di.bao@intel.com>
* add block_dispatch for 64 and 128
Signed-off-by: baodii <di.bao@intel.com>
* prefetch table before gemm
Signed-off-by: Yizhou Wang <yizhou.wang@intel.com>
* tem opt: set num_splits_kv = 1 for llama3-8b
Signed-off-by: baodii <di.bao@intel.com>
* add strategy for num_splits_kv
Signed-off-by: baodii <di.bao@intel.com>
* Delete tests/flash_attn/test_flash_attn_varlen_func_perf.py
Signed-off-by: baodii <di.bao@intel.com>
* make format happy
Signed-off-by: baodii <di.bao@intel.com>
* fix chunked prefill acc issue when not paged
Signed-off-by: baodii <di.bao@intel.com>
* update UT
Signed-off-by: baodii <di.bao@intel.com>
* make format happy
Signed-off-by: baodii <di.bao@intel.com>
* restore and update UT
Signed-off-by: baodii <di.bao@intel.com>
* Resolve Copilot review comments from PR vllm-project#162
- Add bounds check for page_local_idx in chunk_prefill_mainloop.hpp
- Fix get_num_splits to use batch_size instead of num_tokens in flash_api.cpp
- Add docstring for num_splits_kv in flash_attn_interface.py
Signed-off-by: baodii <di.bao@intel.com>
* use sm_count to replace hardcode 20
Signed-off-by: baodii <di.bao@intel.com>
---------
Signed-off-by: baodii <di.bao@intel.com>
Signed-off-by: Yizhou Wang <yizhou.wang@intel.com>
Co-authored-by: Yizhou Wang <yizhou.wang@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
0 commit comments