Fix sparse mask handling in softmax kernel#33814
Fix sparse mask handling in softmax kernel#33814maxnick merged 4 commits intoopenvinotoolkit:masterfrom
Conversation
rkazants
left a comment
There was a problem hiding this comment.
please implement tests
We did some performance and accuracy test, the results can be found in this JIRA ticket.https://jira.devtools.intel.com/browse/CVS-179625 |
liubo-intel
left a comment
There was a problem hiding this comment.
Hi, @mangguo321 : from my understanding, your changes are to fix NaN data issue.
for Xattention cases, these changes make sense to skip the sparse block, and LGTM.
but I think when we have time, better to find out why these v_a/a[i] contain NaN data? Since the v_a/a[i] values serve as input data for this kernel, they are expected to be finite under normal conditions, unless there was a computational error during the previous calculation or a mistake during data loading.
|
Hi @liubo-intel The input to softmax kernel is the output of QK GEMM. In the sparse attention path, the sparse mask caused some blocks to be skipped, so those blocks are not written by the GEMM kernel, as a result, the corresponding regions in the output buffer remain uninitialized and their contents may decode to NAN/Inf values. |
|
No appropriate PR description, no JIRA ticket, no tests |
Hi @rkazants We updated the description with PR details. The JIRA ticket is already referenced in the description. We tested this change with qwen2-7b-instruct and llama-3.2-3b-instruct, the accuracy issue reported in the ticket is resolved and no performance regression was observed. The test results is in the JIRA ticket. Please let me know if any additional information is needed. |
|
Hi @maxnick, could you please take a review? Thanks! |
zhangYiIntel
left a comment
There was a problem hiding this comment.
LGTM, NaN + anything still equals NaN, the setting approach is much better!
|
@mangguo321 , could you please cover your changes with a single layer tests via extending the existing test configurations or developing a new one? |
The softmax kernel unit test was added to cover the code changes in this PR. |
|
@rkazants , the dedicated unit tests were added. |
no more concern rearding PR description and tests
65b105a
### Details: - *Fix sparse mask handling in softmax kernel. In the sparse attention path, the sparse mask caused some blocks to be skipped, so those blocks are not written by the GEMM kernel, as a result, the corresponding regions in the output buffer remain uninitialized and their contents may decode to NAN/Inf values.* - *In this PR, we overwrite the skipped regions with -FLT_MAX to prevent NaN propagation and avoid incorrect computations in downstream kernels* ### Tickets: - *[CVS-179625](https://jira.devtools.intel.com/browse/CVS-179625)*
### Details: - *Fix sparse mask handling in softmax kernel. In the sparse attention path, the sparse mask caused some blocks to be skipped, so those blocks are not written by the GEMM kernel, as a result, the corresponding regions in the output buffer remain uninitialized and their contents may decode to NAN/Inf values.* - *In this PR, we overwrite the skipped regions with -FLT_MAX to prevent NaN propagation and avoid incorrect computations in downstream kernels* ### Tickets: - *[CVS-179625](https://jira.devtools.intel.com/browse/CVS-179625)*
Details:
Tickets: