Skip to content

Add BFloat16 support analysis and plan for CPU and CUDA execution providers#28769

Draft
Copilot wants to merge 3 commits into
mainfrom
copilot/analyze-bfloat16-support
Draft

Add BFloat16 support analysis and plan for CPU and CUDA execution providers#28769
Copilot wants to merge 3 commits into
mainfrom
copilot/analyze-bfloat16-support

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Jun 3, 2026

Description

Adds docs/BFloat16_Support.md, a documentation-only analysis of current bfloat16 (BF16) kernel coverage across the CPU and CUDA execution providers, with a phased plan to close the gaps. Coverage numbers are derived from the registered kernels in docs/OperatorKernels.md (the auto-generated kernel-registry snapshot) and cross-checked against the kernel sources.

  • Coverage inventory — per-EP counts of ops with a tensor(bfloat16) registration, split by ai.onnx vs com.microsoft:
    • CPU: ~23% of ai.onnx ops (45/197), but all data-movement (Cast, Reshape, Gather, Concat, Slice, Transpose, control-flow). No BF16 compute — no MatMul/Add/Softmax/LayerNorm/activations. Element-wise sources even carry explicit // Supposed to add BFloat16 but we are not supporting now markers.
    • CUDA: ~56% of ai.onnx ops (83/149) plus 25 contrib fusions (Attention, MHA, GQA, SkipLayerNorm, MatMulNBits, MoE…); runs most BF16 transformer/CNN graphs end-to-end, with a long tail of gaps (BatchNorm, pooling, several activations/reductions, some fusions).
  • Infrastructure notes — existing BFloat16 type and Cast support, FP32-accumulation requirement, MLAS SBGemm being an ARM64 FP32 fast-math path rather than a native BF16 datatype kernel, and OpTester BF16 readiness.
  • Phased plan — Phase 0 tooling/coverage tracking → Phase 1 CPU transformer-core kernels (highest impact) → Phase 2 CPU vision/elementwise → Phase 3 CPU native SIMD (AVX512-BF16/AMX/NEON) → Phase 4 CUDA long-tail → Phase 5 validation/docs, with acceptance criteria.

Motivation and Context

Most modern models (LLMs, diffusion, transformers) are published in BF16. Today, running them on the CPU EP forces FP32 fallback with inserted Cast nodes (doubling memory footprint), and the CUDA EP still has scattered gaps. This document establishes a measurable, prioritized roadmap toward comprehensive BF16 support.

Copilot AI changed the title [WIP] Analyze bfloat16 support status in CPU and CUDA EP Add BFloat16 support analysis and plan for CPU and CUDA execution providers Jun 3, 2026
Copilot AI requested a review from justinchuby June 3, 2026 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Analyze bfloat16 support status in cpu and cuda EP

2 participants