Skip to content

Numerical stability issue in recent commits since 0.2.0 #805

Open
@rchardx

Description

Environment: CUDA 12.6, Hopper architecture.

Recent commits have significantly impacted the numerical stability of Attention. This can be observed in the logs, where different commits show considerable differences in their results when compared to the float version reference implementation.

One concern I have is that we're observing an increasing trend in these diffs, which might indicate potential underlying issues.
Another issue is that FA3 template produces NaNs in the results after prefilling.

We kindly request developers to pay attention to this aspect during future updates.

main commit: 054778

W20250211 04:30:59.612421 33250 test_flashinfer_prefill.cu:338] batch_size: 6, num_qo_heads: 8, num_kv_heads: 1, head_dim: 128, num_mismatches: 80
W20250211 04:30:59.612534 33250 test_flashinfer_prefill.cu:343] diff: 1.281738e-03, idx: 4052, o_host: -6.591797e-03, o_ref: -7.873535e-03
W20250211 04:30:59.612591 33250 test_flashinfer_prefill.cu:343] diff: 1.220703e-03, idx: 4064, o_host: -1.226807e-02, o_ref: -1.348877e-02
W20250211 04:30:59.612596 33250 test_flashinfer_prefill.cu:343] diff: 1.220703e-03, idx: 1636, o_host: 4.223633e-02, o_ref: 4.345703e-02
W20250211 04:30:59.612600 33250 test_flashinfer_prefill.cu:343] diff: 1.220703e-03, idx: 1606, o_host: 2.429199e-02, o_ref: 2.551270e-02
W20250211 04:30:59.612604 33250 test_flashinfer_prefill.cu:343] diff: 1.220703e-03, idx: 1553, o_host: 3.198242e-02, o_ref: 3.320312e-02
....../tests/kernel/test_flashinfer_prefill.cu:348: Failure
Expected equality of these values:
  num_mismatches
    Which is: 80
  0

main commit: 956910

W20250211 04:20:27.498323 31762 test_flashinfer_prefill.cu:338] batch_size: 6, num_qo_heads: 8, num_kv_heads: 1, head_dim: 128, num_mismatches: 297
W20250211 04:20:27.498430 31762 test_flashinfer_prefill.cu:343] diff: 2.929688e-03, idx: 5820, o_host: -1.513672e-01, o_ref: -1.484375e-01
W20250211 04:20:27.498482 31762 test_flashinfer_prefill.cu:343] diff: 2.441406e-03, idx: 5840, o_host: 9.277344e-02, o_ref: 9.033203e-02
W20250211 04:20:27.498489 31762 test_flashinfer_prefill.cu:343] diff: 2.258301e-03, idx: 5766, o_host: 1.594543e-03, o_ref: -6.637573e-04
W20250211 04:20:27.498495 31762 test_flashinfer_prefill.cu:343] diff: 1.953125e-03, idx: 5847, o_host: 7.080078e-02, o_ref: 6.884766e-02
W20250211 04:20:27.498500 31762 test_flashinfer_prefill.cu:343] diff: 1.953125e-03, idx: 2983, o_host: -8.007812e-02, o_ref: -8.203125e-02
....../tests/kernel/test_flashinfer_prefill.cu:348: Failure
Expected equality of these values:
  num_mismatches
    Which is: 297
  0

main commit:9f5fbe

W20250211 05:20:10.827390 38669 test_flashinfer_prefill.cu:460] batch_size: 4, num_qo_heads: 28, num_kv_heads: 4, head_dim: 128, num_mismatches: 25
W20250211 05:20:10.827448 38669 test_flashinfer_prefill.cu:465] diff: 0.000793457, idx: 256058
W20250211 05:20:10.827464 38669 test_flashinfer_prefill.cu:465] diff: 0.000793457, idx: 253146
W20250211 05:20:10.827469 38669 test_flashinfer_prefill.cu:465] diff: 0.000793457, idx: 252594
W20250211 05:20:10.827476 38669 test_flashinfer_prefill.cu:465] diff: 0.000747681, idx: 219744
W20250211 05:20:10.827483 38669 test_flashinfer_prefill.cu:465] diff: 0.000717163, idx: 217179
....../tests/kernel/test_flashinfer_prefill.cu:468: Failure
Expected equality of these values:
  num_mismatches
    Which is: 25
  0

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions