|
| 1 | +.. _resource_management: |
| 2 | + |
| 3 | +Resource Management |
| 4 | +=================== |
| 5 | + |
| 6 | +Overview |
| 7 | +-------- |
| 8 | + |
| 9 | +Efficient control of CPU and GPU memory is essential for successful model compilation, |
| 10 | +especially when working with large models such as LLMs or diffusion models. |
| 11 | +Uncontrolled memory growth can cause compilation failures or process termination. |
| 12 | +This guide describes the symptoms of excessive memory usage and provides methods |
| 13 | +to reduce both CPU and GPU memory consumption. |
| 14 | + |
| 15 | +Memory Usage Control |
| 16 | +-------------------- |
| 17 | + |
| 18 | +CPU Memory |
| 19 | +^^^^^^^^^^ |
| 20 | + |
| 21 | +By default, Torch-TensorRT may consume up to **5×** the model size in CPU memory. |
| 22 | +This can exceed system limits when compiling large models. |
| 23 | + |
| 24 | +**Common symptoms of high CPU memory usage:** |
| 25 | + |
| 26 | +- Program freeze |
| 27 | +- Process terminated by the operating system |
| 28 | + |
| 29 | +**Ways to lower CPU memory usage:** |
| 30 | + |
| 31 | +1. **Enable memory trimming** |
| 32 | + |
| 33 | + Set the following environment variable: |
| 34 | + |
| 35 | + .. code-block:: bash |
| 36 | +
|
| 37 | + export TRIM_CPU_MEMORY=1 |
| 38 | +
|
| 39 | + This reduces approximately **2×** of redundant model copies, limiting |
| 40 | + total CPU memory usage to up to **3×** the model size. |
| 41 | + |
| 42 | +2. **Disable CPU offloading** |
| 43 | + |
| 44 | + In compilation settings, set: |
| 45 | + |
| 46 | + .. code-block:: python |
| 47 | +
|
| 48 | + offload_module_to_cpu = False |
| 49 | +
|
| 50 | + This removes another **1×** model copy, reducing peak CPU memory |
| 51 | + usage to about **2×** the model size. |
| 52 | + |
| 53 | +GPU Memory |
| 54 | +^^^^^^^^^^ |
| 55 | + |
| 56 | +By default, Torch-TensorRT may consume up to **2×** the model size in GPU memory. |
| 57 | + |
| 58 | +**Common symptoms of high GPU memory usage:** |
| 59 | + |
| 60 | +- CUDA out-of-memory errors |
| 61 | +- TensorRT compilation errors |
| 62 | + |
| 63 | +**Ways to lower GPU memory usage:** |
| 64 | + |
| 65 | +1. **Enable offloading to CPU** |
| 66 | + |
| 67 | + In compilation settings, set: |
| 68 | + |
| 69 | + .. code-block:: python |
| 70 | +
|
| 71 | + offload_module_to_cpu = True |
| 72 | +
|
| 73 | + This shifts one model copy from GPU to CPU memory. |
| 74 | + As a result, peak GPU memory usage decreases to about **1×** |
| 75 | + the model size, while CPU memory usage increases by roughly **1×**. |
| 76 | + |
| 77 | +Summary |
| 78 | +------- |
| 79 | + |
| 80 | +| Setting | Effect | Approx. Memory Ratio | |
| 81 | +|----------|---------|----------------------| |
| 82 | +| Default | Baseline behavior | CPU: 5×, GPU: 2× | |
| 83 | +| ``export TRIM_CPU_MEMORY=1`` | Reduces redundant CPU copies | CPU: ~3× | |
| 84 | +| ``offload_module_to_cpu=False`` | Further reduces CPU copies | CPU: ~2× | |
| 85 | +| ``offload_module_to_cpu=True`` | Reduces GPU usage, increases CPU usage | GPU: ~1×, CPU: +1× | |
| 86 | +
|
| 87 | +Proper configuration ensures efficient resource use, stable compilation, |
| 88 | +and predictable performance for large-scale models. |
0 commit comments