Skip to content

feat: Add BF16 float-accumulator TensorOp epilogue specialization#3287

Open
lyuxy-infra wants to merge 1 commit into
NVIDIA:mainfrom
lyuxy-infra:bf16-float-epilogue-specialization
Open

feat: Add BF16 float-accumulator TensorOp epilogue specialization#3287
lyuxy-infra wants to merge 1 commit into
NVIDIA:mainfrom
lyuxy-infra:bf16-float-epilogue-specialization

Conversation

@lyuxy-infra
Copy link
Copy Markdown

Summary

This PR adds a DefaultIteratorsTensorOp<bfloat16_t, float, 8, ...> specialization for TensorOp epilogues.

CUTLASS already has a half_t, float, 8 specialization that uses TileIteratorTensorOpMixed and SharedLoadIteratorMixed to optimize mixed-precision epilogues with FP32 accumulators and 16-bit outputs. BF16 output with FP32 accumulators has the same 32-bit accumulator / 16-bit output / 8-elements-per-access structure, but currently falls back to the generic iterator path.

This patch mirrors the existing half_t, float, 8 specialization for bfloat16_t, float, 8.

Motivation

For mixed-precision TensorOp epilogues where accumulators are FP32 and outputs are 16-bit, the mixed iterator path uses a shared-memory layout designed to avoid bank conflicts. BF16 output should be able to use the same iterator structure as FP16 output.

Changes

  • Add DefaultIteratorsTensorOp<bfloat16_t, float, 8, ...>
  • Use TileIteratorTensorOpMixed<..., float, 32, 16, 8, 8>
  • Use SharedLoadIteratorMixed<..., float, 32, 16, 8, 8>
  • Set kFragmentsPerIteration = 2, matching the existing half_t, float, 8 specialization

Notes

This does not change the output operator, numerical conversion, or GEMM mainloop. It only changes the epilogue shared-memory staging iterator selection for this BF16 mixed-precision case.

@lyuxy-infra lyuxy-infra changed the title Add BF16 float-accumulator TensorOp epilogue specialization feat: Add BF16 float-accumulator TensorOp epilogue specialization Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant