Add BFloat16 support analysis and plan for CPU and CUDA execution providers by Copilot · Pull Request #28769 · microsoft/onnxruntime

Copilot · 2026-06-03T18:32:16Z

Description

Adds docs/BFloat16_Support.md, a documentation-only analysis of current bfloat16 (BF16) kernel coverage across the CPU and CUDA execution providers, with a phased plan to close the gaps. Coverage numbers are derived from the registered kernels in docs/OperatorKernels.md (the auto-generated kernel-registry snapshot) and cross-checked against the kernel sources.

Coverage inventory — per-EP counts of ops with a tensor(bfloat16) registration, split by ai.onnx vs com.microsoft:
- CPU: ~23% of ai.onnx ops (45/197), but all data-movement (Cast, Reshape, Gather, Concat, Slice, Transpose, control-flow). No BF16 compute — no MatMul/Add/Softmax/LayerNorm/activations. Element-wise sources even carry explicit // Supposed to add BFloat16 but we are not supporting now markers.
- CUDA: ~56% of ai.onnx ops (83/149) plus 25 contrib fusions (Attention, MHA, GQA, SkipLayerNorm, MatMulNBits, MoE…); runs most BF16 transformer/CNN graphs end-to-end, with a long tail of gaps (BatchNorm, pooling, several activations/reductions, some fusions).
Infrastructure notes — existing BFloat16 type and Cast support, FP32-accumulation requirement, MLAS SBGemm being an ARM64 FP32 fast-math path rather than a native BF16 datatype kernel, and OpTester BF16 readiness.
Phased plan — Phase 0 tooling/coverage tracking → Phase 1 CPU transformer-core kernels (highest impact) → Phase 2 CPU vision/elementwise → Phase 3 CPU native SIMD (AVX512-BF16/AMX/NEON) → Phase 4 CUDA long-tail → Phase 5 validation/docs, with acceptance criteria.

Motivation and Context

Most modern models (LLMs, diffusion, transformers) are published in BF16. Today, running them on the CPU EP forces FP32 fallback with inserted Cast nodes (doubling memory footprint), and the CUDA EP still has scattered gaps. This document establishes a measurable, prioritized roadmap toward comprehensive BF16 support.

Initial plan

849dba8

Copilot AI assigned Copilot and justinchuby Jun 3, 2026

Copilot started work on behalf of justinchuby June 3, 2026 18:32 View session

Copilot AI linked an issue Jun 3, 2026 that may be closed by this pull request

Analyze bfloat16 support status in cpu and cuda EP #28768

Open

Copilot AI added 2 commits June 3, 2026 18:37

Add BFloat16 support analysis and plan doc for CPU/CUDA EPs

1db6097

Clarify pooling ops in BF16 plan Phase 2

401bb9e

Copilot AI changed the title ~~[WIP] Analyze bfloat16 support status in CPU and CUDA EP~~ Add BFloat16 support analysis and plan for CPU and CUDA execution providers Jun 3, 2026

Copilot finished work on behalf of justinchuby June 3, 2026 18:39

Copilot AI requested a review from justinchuby June 3, 2026 18:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BFloat16 support analysis and plan for CPU and CUDA execution providers#28769

Add BFloat16 support analysis and plan for CPU and CUDA execution providers#28769
Copilot wants to merge 3 commits into
mainfrom
copilot/analyze-bfloat16-support

Copilot AI commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jun 3, 2026 •

edited

Loading