Skip to content

Attention to better MLPerf and beyond #108

@raikonenfnu

Description

@raikonenfnu
  1. General Attention Health
  • Modifying kWidth to maximize reads from shared memory
  • Modifying kWidth S.T FP8 do not need trip to shared memory.
  • Enable attention transposeV when possible (in progress)
  • Dot slicing for better instruction scheduling
  • Buffer loads for free masking and move K,V directly from global to shared memory
  • Instruction scheduling / software pipelining to overlap MMA and softmax
  • Prefetch/MultiBuffering
  • Try dot3d/ single kernel split-K to get faster attention on decode phase

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions