Skip to content

TorchWave fused PyTorch nativert executor (#16878)#16878

Open
oerling wants to merge 1 commit intofacebookincubator:mainfrom
oerling:export-D95696931
Open

TorchWave fused PyTorch nativert executor (#16878)#16878
oerling wants to merge 1 commit intofacebookincubator:mainfrom
oerling:export-D95696931

Conversation

@oerling
Copy link
Contributor

@oerling oerling commented Mar 22, 2026

Summary:

TorchWave is a GPU kernel fusion and execution framework that compiles nativert FX graphs into fused CUDA kernels. It analyzes the dataflow graph, groups operations into composite kernels, generates CUDA code, and executes them with efficient GPU resource management.

Core files:

  • Registry.h/.cpp, Builtins.cpp — Operation registry mapping nativert op names to metadata (elementwise traits, cost, code generation functions). Builtins registers standard aten ops (add, sub, mul, div, etc.).

  • ParallelExpr.h/.cpp — Analyzes the nativert graph to identify independent subgraphs (ProjectNodes) that can execute in parallel, partitioning the graph into sequential stages with internal parallelism.

  • Compile.h/.cpp — The compilation pipeline. Extracts subgraphs from ProjectNodes, matches isomorphic subgraphs to reuse compiled kernels, generates fused CUDA code for elementwise expression trees, and assembles CompositeKernels from multiple KernelOperations.

  • CompiledOp.h — Data structures for the compiled representation: KernelOperation (a single fused op with CUDA code), OpInvocation (runtime binding of a KernelOperation to actual values), CompositeKernel (groups KernelOperations into one compiled CUDA kernel), CompositeInvocation (runtime invocation with grid building and param filling), CompiledNode (parallel/sequential kernel groups), and WaveGraph (top-level compiled graph).

  • Executor.h/.cpp — The executor that ties compilation to execution. WaveGraphExecutor extends nativert's GraphExecutorBase. Contains process-wide GPU resource initialization (arenas, stream/event pools), ExecutionState management, grid construction (makeGrid), and the full execution path: output allocation, BlockInfo grid building, pinned buffer param filling, H2D transfer, and kernel launch.

  • KernelParams.h — Shared host/device structs: Tensor (storage pointer + dims/strides for up to 3D), BlockInfo (per-thread-block dispatch info with op code, element count, param pointer), TorchWaveParams (kernel entry point parameter).

  • Core.cuh, Elementwise.cuh — CUDA device code. Core.cuh provides the kernel entry macro (ENTRY) and BlockInfo loading. Elementwise.cuh implements the fused elementwise kernel body with fast-path detection for contiguous tensors and broadcast support.

  • Utils.h/.cpp — Thread-safe Pool template for reusing GPU Streams and Events.

  • Pt2Load.h/.cpp — Loading .pt2 archives: deserializes the nativert graph, tensor metadata, and weight paths from PyTorchStreamReader.

  • GraphView.h/.cpp — Diagnostic printing of the nativert graph and compiled WaveGraph structure.

  • NativertSerialization.cpp — Deserialization of nativert graph IR from JSON format within .pt2 archives.

  • Execute.h/.cpp — Standalone entry point for loading and running a .pt2 model through the WaveGraph executor.

  • Project.h/.cpp — ProjectNode representation for the parallelism analysis stage.

  • tests/ExecutorTest.cpp — End-to-end test: loads a .pt2 model, runs it through both nativert SerialGraphExecutor and WaveGraphExecutor, verifies outputs match eager-mode expectations.

  • tests/GraphTool.cpp — CLI tool for inspecting .pt2 graph structure and compiled WaveGraph.

  • tests/element_test.py, element_test_run_pt2.py — Python test model (elementwise arithmetic on int64 tensors) and script to export it as a .pt2 archive for the C++ tests.

Differential Revision: D95696931

@netlify
Copy link

netlify bot commented Mar 22, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit d20d472
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69c0941874f048000802fc50

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 22, 2026
@meta-codesync
Copy link

meta-codesync bot commented Mar 22, 2026

@oerling has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95696931.

Summary:

TorchWave is a GPU kernel fusion and execution framework that compiles nativert FX graphs into fused CUDA kernels. It analyzes the dataflow graph, groups operations into composite kernels, generates CUDA code, and executes them with efficient GPU resource management.

**Core files:**

- **Registry.h/.cpp, Builtins.cpp** — Operation registry mapping nativert op names to metadata (elementwise traits, cost, code generation functions). Builtins registers standard aten ops (add, sub, mul, div, etc.).

- **ParallelExpr.h/.cpp** — Analyzes the nativert graph to identify independent subgraphs (ProjectNodes) that can execute in parallel, partitioning the graph into sequential stages with internal parallelism.

- **Compile.h/.cpp** — The compilation pipeline. Extracts subgraphs from ProjectNodes, matches isomorphic subgraphs to reuse compiled kernels, generates fused CUDA code for elementwise expression trees, and assembles CompositeKernels from multiple KernelOperations.

- **CompiledOp.h** — Data structures for the compiled representation: KernelOperation (a single fused op with CUDA code), OpInvocation (runtime binding of a KernelOperation to actual values), CompositeKernel (groups KernelOperations into one compiled CUDA kernel), CompositeInvocation (runtime invocation with grid building and param filling), CompiledNode (parallel/sequential kernel groups), and WaveGraph (top-level compiled graph).

- **Executor.h/.cpp** — The executor that ties compilation to execution. WaveGraphExecutor extends nativert's GraphExecutorBase. Contains process-wide GPU resource initialization (arenas, stream/event pools), ExecutionState management, grid construction (makeGrid), and the full execution path: output allocation, BlockInfo grid building, pinned buffer param filling, H2D transfer, and kernel launch.

- **KernelParams.h** — Shared host/device structs: Tensor (storage pointer + dims/strides for up to 3D), BlockInfo (per-thread-block dispatch info with op code, element count, param pointer), TorchWaveParams (kernel entry point parameter).

- **Core.cuh, Elementwise.cuh** — CUDA device code. Core.cuh provides the kernel entry macro (ENTRY) and BlockInfo loading. Elementwise.cuh implements the fused elementwise kernel body with fast-path detection for contiguous tensors and broadcast support.

- **Utils.h/.cpp** — Thread-safe Pool<T> template for reusing GPU Streams and Events.

- **Pt2Load.h/.cpp** — Loading .pt2 archives: deserializes the nativert graph, tensor metadata, and weight paths from PyTorchStreamReader.

- **GraphView.h/.cpp** — Diagnostic printing of the nativert graph and compiled WaveGraph structure.

- **NativertSerialization.cpp** — Deserialization of nativert graph IR from JSON format within .pt2 archives.

- **Execute.h/.cpp** — Standalone entry point for loading and running a .pt2 model through the WaveGraph executor.

- **Project.h/.cpp** — ProjectNode representation for the parallelism analysis stage.

- **tests/ExecutorTest.cpp** — End-to-end test: loads a .pt2 model, runs it through both nativert SerialGraphExecutor and WaveGraphExecutor, verifies outputs match eager-mode expectations.

- **tests/GraphTool.cpp** — CLI tool for inspecting .pt2 graph structure and compiled WaveGraph.

- **tests/element_test.py, element_test_run_pt2.py** — Python test model (elementwise arithmetic on int64 tensors) and script to export it as a .pt2 archive for the C++ tests.

Differential Revision: D95696931
@meta-codesync meta-codesync bot changed the title TorchWave fused PyTorch nativert executor TorchWave fused PyTorch nativert executor (#16878) Mar 23, 2026
@oerling oerling requested a review from majetideepak as a code owner March 23, 2026 01:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant