Open
Description
Environment: CUDA 12.6, Hopper architecture.
Recent commits have significantly impacted the numerical stability of Attention. This can be observed in the logs, where different commits show considerable differences in their results when compared to the float version reference implementation.
One concern I have is that we're observing an increasing trend in these diffs, which might indicate potential underlying issues.
Another issue is that FA3 template produces NaNs in the results after prefilling.
We kindly request developers to pay attention to this aspect during future updates.
main commit: 054778
W20250211 04:30:59.612421 33250 test_flashinfer_prefill.cu:338] batch_size: 6, num_qo_heads: 8, num_kv_heads: 1, head_dim: 128, num_mismatches: 80
W20250211 04:30:59.612534 33250 test_flashinfer_prefill.cu:343] diff: 1.281738e-03, idx: 4052, o_host: -6.591797e-03, o_ref: -7.873535e-03
W20250211 04:30:59.612591 33250 test_flashinfer_prefill.cu:343] diff: 1.220703e-03, idx: 4064, o_host: -1.226807e-02, o_ref: -1.348877e-02
W20250211 04:30:59.612596 33250 test_flashinfer_prefill.cu:343] diff: 1.220703e-03, idx: 1636, o_host: 4.223633e-02, o_ref: 4.345703e-02
W20250211 04:30:59.612600 33250 test_flashinfer_prefill.cu:343] diff: 1.220703e-03, idx: 1606, o_host: 2.429199e-02, o_ref: 2.551270e-02
W20250211 04:30:59.612604 33250 test_flashinfer_prefill.cu:343] diff: 1.220703e-03, idx: 1553, o_host: 3.198242e-02, o_ref: 3.320312e-02
....../tests/kernel/test_flashinfer_prefill.cu:348: Failure
Expected equality of these values:
num_mismatches
Which is: 80
0
main commit: 956910
W20250211 04:20:27.498323 31762 test_flashinfer_prefill.cu:338] batch_size: 6, num_qo_heads: 8, num_kv_heads: 1, head_dim: 128, num_mismatches: 297
W20250211 04:20:27.498430 31762 test_flashinfer_prefill.cu:343] diff: 2.929688e-03, idx: 5820, o_host: -1.513672e-01, o_ref: -1.484375e-01
W20250211 04:20:27.498482 31762 test_flashinfer_prefill.cu:343] diff: 2.441406e-03, idx: 5840, o_host: 9.277344e-02, o_ref: 9.033203e-02
W20250211 04:20:27.498489 31762 test_flashinfer_prefill.cu:343] diff: 2.258301e-03, idx: 5766, o_host: 1.594543e-03, o_ref: -6.637573e-04
W20250211 04:20:27.498495 31762 test_flashinfer_prefill.cu:343] diff: 1.953125e-03, idx: 5847, o_host: 7.080078e-02, o_ref: 6.884766e-02
W20250211 04:20:27.498500 31762 test_flashinfer_prefill.cu:343] diff: 1.953125e-03, idx: 2983, o_host: -8.007812e-02, o_ref: -8.203125e-02
....../tests/kernel/test_flashinfer_prefill.cu:348: Failure
Expected equality of these values:
num_mismatches
Which is: 297
0
main commit:9f5fbe
W20250211 05:20:10.827390 38669 test_flashinfer_prefill.cu:460] batch_size: 4, num_qo_heads: 28, num_kv_heads: 4, head_dim: 128, num_mismatches: 25
W20250211 05:20:10.827448 38669 test_flashinfer_prefill.cu:465] diff: 0.000793457, idx: 256058
W20250211 05:20:10.827464 38669 test_flashinfer_prefill.cu:465] diff: 0.000793457, idx: 253146
W20250211 05:20:10.827469 38669 test_flashinfer_prefill.cu:465] diff: 0.000793457, idx: 252594
W20250211 05:20:10.827476 38669 test_flashinfer_prefill.cu:465] diff: 0.000747681, idx: 219744
W20250211 05:20:10.827483 38669 test_flashinfer_prefill.cu:465] diff: 0.000717163, idx: 217179
....../tests/kernel/test_flashinfer_prefill.cu:468: Failure
Expected equality of these values:
num_mismatches
Which is: 25
0
Activity