|
4 | 4 | Torch Export with Cudagraphs
|
5 | 5 | ======================================================
|
6 | 6 |
|
7 |
| -This interactive script is intended as an overview of the process by which the Torch-TensorRT Cudagraphs integration can be used in the `ir="dynamo"` path. The functionality works similarly in the `torch.compile` path as well. |
| 7 | +CUDA Graphs allow multiple GPU operations to be launched through a single CPU operation, reducing launch overheads and improving GPU utilization. Torch-TensorRT provides a simple interface to enable CUDA graphs. This feature allows users to easily leverage the performance benefits of CUDA graphs without managing the complexities of capture and replay manually. |
| 8 | +
|
| 9 | +.. image:: /tutorials/images/cuda_graphs.png |
| 10 | +
|
| 11 | +This interactive script is intended as an overview of the process by which the Torch-TensorRT Cudagraphs integration can be used in the `ir="dynamo"` path. The functionality works similarly in the |
| 12 | +`torch.compile` path as well. |
8 | 13 | """
|
9 | 14 |
|
10 | 15 | # %%
|
|
70 | 75 |
|
71 | 76 | # %%
|
72 | 77 | # Cuda graphs with module that contains graph breaks
|
73 |
| -# ---------------------------------- |
| 78 | +# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
74 | 79 | #
|
75 | 80 | # When CUDA Graphs are applied to a TensorRT model that contains graph breaks, each break introduces additional
|
76 | 81 | # overhead. This occurs because graph breaks prevent the entire model from being executed as a single, continuous
|
77 | 82 | # optimized unit. As a result, some of the performance benefits typically provided by CUDA Graphs, such as reduced
|
78 | 83 | # kernel launch overhead and improved execution efficiency, may be diminished.
|
| 84 | +# |
79 | 85 | # Using a wrapped runtime module with CUDA Graphs allows you to encapsulate sequences of operations into graphs
|
80 |
| -# that can be executed efficiently, even in the presence of graph breaks. |
81 |
| -# If TensorRT module has graph breaks, CUDA Graph context manager returns a wrapped_module. This module captures entire |
82 |
| -# execution graph, enabling efficient replay during subsequent inferences by reducing kernel launch overheads |
83 |
| -# and improving performance. Note that initializing with the wrapper module involves a warm-up phase where the |
| 86 | +# that can be executed efficiently, even in the presence of graph breaks. If TensorRT module has graph breaks, CUDA |
| 87 | +# Graph context manager returns a wrapped_module. And this module captures entire execution graph, enabling efficient |
| 88 | +# replay during subsequent inferences by reducing kernel launch overheads and improving performance. |
| 89 | +# |
| 90 | +# Note that initializing with the wrapper module involves a warm-up phase where the |
84 | 91 | # module is executed several times. This warm-up ensures that memory allocations and initializations are not
|
85 | 92 | # recorded in CUDA Graphs, which helps maintain consistent execution paths and optimize performance.
|
| 93 | +# |
| 94 | +# .. image:: /tutorials/images/cuda_graphs_breaks.png |
| 95 | +# :scale: 60 % |
| 96 | +# :align: left |
86 | 97 |
|
87 | 98 |
|
88 | 99 | class SampleModel(torch.nn.Module):
|
|
0 commit comments