TorchWave fused PyTorch nativert executor (#16878) by oerling · Pull Request #16878 · facebookincubator/velox

oerling · 2026-03-22T02:12:06Z

Summary:

TorchWave is a GPU kernel fusion and execution framework that compiles nativert FX graphs into fused CUDA kernels. It analyzes the dataflow graph, groups operations into composite kernels, generates CUDA code, and executes them with efficient GPU resource management.

Core files:

Registry.h/.cpp, Builtins.cpp — Operation registry mapping nativert op names to metadata (elementwise traits, cost, code generation functions). Builtins registers standard aten ops (add, sub, mul, div, etc.).
ParallelExpr.h/.cpp — Analyzes the nativert graph to identify independent subgraphs (ProjectNodes) that can execute in parallel, partitioning the graph into sequential stages with internal parallelism.
Compile.h/.cpp — The compilation pipeline. Extracts subgraphs from ProjectNodes, matches isomorphic subgraphs to reuse compiled kernels, generates fused CUDA code for elementwise expression trees, and assembles CompositeKernels from multiple KernelOperations.
CompiledOp.h — Data structures for the compiled representation: KernelOperation (a single fused op with CUDA code), OpInvocation (runtime binding of a KernelOperation to actual values), CompositeKernel (groups KernelOperations into one compiled CUDA kernel), CompositeInvocation (runtime invocation with grid building and param filling), CompiledNode (parallel/sequential kernel groups), and WaveGraph (top-level compiled graph).
Executor.h/.cpp — The executor that ties compilation to execution. WaveGraphExecutor extends nativert's GraphExecutorBase. Contains process-wide GPU resource initialization (arenas, stream/event pools), ExecutionState management, grid construction (makeGrid), and the full execution path: output allocation, BlockInfo grid building, pinned buffer param filling, H2D transfer, and kernel launch.
KernelParams.h — Shared host/device structs: Tensor (storage pointer + dims/strides for up to 3D), BlockInfo (per-thread-block dispatch info with op code, element count, param pointer), TorchWaveParams (kernel entry point parameter).
Core.cuh, Elementwise.cuh — CUDA device code. Core.cuh provides the kernel entry macro (ENTRY) and BlockInfo loading. Elementwise.cuh implements the fused elementwise kernel body with fast-path detection for contiguous tensors and broadcast support.
Utils.h/.cpp — Thread-safe Pool template for reusing GPU Streams and Events.
Pt2Load.h/.cpp — Loading .pt2 archives: deserializes the nativert graph, tensor metadata, and weight paths from PyTorchStreamReader.
GraphView.h/.cpp — Diagnostic printing of the nativert graph and compiled WaveGraph structure.
NativertSerialization.cpp — Deserialization of nativert graph IR from JSON format within .pt2 archives.
Execute.h/.cpp — Standalone entry point for loading and running a .pt2 model through the WaveGraph executor.
Project.h/.cpp — ProjectNode representation for the parallelism analysis stage.
tests/ExecutorTest.cpp — End-to-end test: loads a .pt2 model, runs it through both nativert SerialGraphExecutor and WaveGraphExecutor, verifies outputs match eager-mode expectations.
tests/GraphTool.cpp — CLI tool for inspecting .pt2 graph structure and compiled WaveGraph.
tests/element_test.py, element_test_run_pt2.py — Python test model (elementwise arithmetic on int64 tensors) and script to export it as a .pt2 archive for the C++ tests.

Differential Revision: D95696931

netlify · 2026-03-22T02:12:12Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`d20d472`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/69c0941874f048000802fc50

meta-codesync · 2026-03-22T02:12:16Z

@oerling has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95696931.

Summary: TorchWave is a GPU kernel fusion and execution framework that compiles nativert FX graphs into fused CUDA kernels. It analyzes the dataflow graph, groups operations into composite kernels, generates CUDA code, and executes them with efficient GPU resource management. **Core files:** - **Registry.h/.cpp, Builtins.cpp** — Operation registry mapping nativert op names to metadata (elementwise traits, cost, code generation functions). Builtins registers standard aten ops (add, sub, mul, div, etc.). - **ParallelExpr.h/.cpp** — Analyzes the nativert graph to identify independent subgraphs (ProjectNodes) that can execute in parallel, partitioning the graph into sequential stages with internal parallelism. - **Compile.h/.cpp** — The compilation pipeline. Extracts subgraphs from ProjectNodes, matches isomorphic subgraphs to reuse compiled kernels, generates fused CUDA code for elementwise expression trees, and assembles CompositeKernels from multiple KernelOperations. - **CompiledOp.h** — Data structures for the compiled representation: KernelOperation (a single fused op with CUDA code), OpInvocation (runtime binding of a KernelOperation to actual values), CompositeKernel (groups KernelOperations into one compiled CUDA kernel), CompositeInvocation (runtime invocation with grid building and param filling), CompiledNode (parallel/sequential kernel groups), and WaveGraph (top-level compiled graph). - **Executor.h/.cpp** — The executor that ties compilation to execution. WaveGraphExecutor extends nativert's GraphExecutorBase. Contains process-wide GPU resource initialization (arenas, stream/event pools), ExecutionState management, grid construction (makeGrid), and the full execution path: output allocation, BlockInfo grid building, pinned buffer param filling, H2D transfer, and kernel launch. - **KernelParams.h** — Shared host/device structs: Tensor (storage pointer + dims/strides for up to 3D), BlockInfo (per-thread-block dispatch info with op code, element count, param pointer), TorchWaveParams (kernel entry point parameter). - **Core.cuh, Elementwise.cuh** — CUDA device code. Core.cuh provides the kernel entry macro (ENTRY) and BlockInfo loading. Elementwise.cuh implements the fused elementwise kernel body with fast-path detection for contiguous tensors and broadcast support. - **Utils.h/.cpp** — Thread-safe Pool<T> template for reusing GPU Streams and Events. - **Pt2Load.h/.cpp** — Loading .pt2 archives: deserializes the nativert graph, tensor metadata, and weight paths from PyTorchStreamReader. - **GraphView.h/.cpp** — Diagnostic printing of the nativert graph and compiled WaveGraph structure. - **NativertSerialization.cpp** — Deserialization of nativert graph IR from JSON format within .pt2 archives. - **Execute.h/.cpp** — Standalone entry point for loading and running a .pt2 model through the WaveGraph executor. - **Project.h/.cpp** — ProjectNode representation for the parallelism analysis stage. - **tests/ExecutorTest.cpp** — End-to-end test: loads a .pt2 model, runs it through both nativert SerialGraphExecutor and WaveGraphExecutor, verifies outputs match eager-mode expectations. - **tests/GraphTool.cpp** — CLI tool for inspecting .pt2 graph structure and compiled WaveGraph. - **tests/element_test.py, element_test_run_pt2.py** — Python test model (elementwise arithmetic on int64 tensors) and script to export it as a .pt2 archive for the C++ tests. Differential Revision: D95696931

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 22, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 22, 2026

meta-codesync bot changed the title ~~TorchWave fused PyTorch nativert executor~~ TorchWave fused PyTorch nativert executor (#16878) Mar 23, 2026

oerling force-pushed the export-D95696931 branch from d5a5e14 to d20d472 Compare March 23, 2026 01:15

oerling requested a review from majetideepak as a code owner March 23, 2026 01:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorchWave fused PyTorch nativert executor (#16878)#16878

TorchWave fused PyTorch nativert executor (#16878)#16878
oerling wants to merge 1 commit intofacebookincubator:mainfrom
oerling:export-D95696931

oerling commented Mar 22, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

netlify bot commented Mar 22, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oerling commented Mar 22, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

meta-codesync bot commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

oerling commented Mar 22, 2026 •

edited by meta-codesync bot

Loading

netlify bot commented Mar 22, 2026 •

edited

Loading