Feature
Add TensorRT weight streaming support to the Torch-TensorRT ExecuTorch delegate runtime. This lets a model whose weights do not all fit in GPU memory still run after it is exported to an ExecuTorch program (a .pte file).
Problem
Torch-TensorRT can already build a weight streamable engine when you compile with enable_weight_streaming=True, and it can export to ExecuTorch with torch_tensorrt.save(..., output_format="executorch").
The problem is that the ExecuTorch delegate never sets a weight streaming budget. After it deserializes the engine it goes straight to creating the execution context. With no budget set, TensorRT keeps all weights resident on the GPU, so a model that is larger than GPU memory fails to run through the ExecuTorch path, even when the engine was built for streaming.
The other Torch-TensorRT runtimes already set a budget right after deserializing the engine, which is why they can stream. The ExecuTorch delegate was missing this one step.
Solution
Set the weight streaming budget inside the ExecuTorch delegate init(), after the engine is deserialized and before the execution context is created. This matches the other runtimes and respects the TensorRT rule that the budget cannot change once a context exists.
By default the delegate applies TensorRT's automatic budget, which TensorRT computes at load time from the free memory on the actual GPU. So an engine built with enable_weight_streaming=True runs out of the box, and the budget adapts to whatever device the .pte is deployed on. This is the same thing the existing PyTorch runtimes do right after they deserialize an engine.
An optional explicit budget can be set at export time. It travels to the runtime as an ExecuTorch CompileSpec, which is a key and value that ExecuTorch already stores in the .pte and passes back to the backend at load time. This needs no changes to ExecuTorch and no change to the engine blob format.
The option
weight_streaming_budget on save(..., output_format="executorch"):
None (the default): the delegate applies TensorRT's automatic budget at load.
- a non-negative integer: a fixed GPU budget for weights, in bytes.
This mirrors the byte budget used by the existing Torch-TensorRT runtimes. There is no string vocabulary to remember.
Ahead of time (Python)
import torch_tensorrt
compiled = torch_tensorrt.compile(
model, arg_inputs=example_inputs, enable_weight_streaming=True
)
# Works out of the box. The delegate applies the automatic budget at load,
# so a model whose weights exceed GPU memory runs.
torch_tensorrt.save(
compiled, "model.pte", arg_inputs=example_inputs, output_format="executorch"
)
# Optional: pin an explicit GPU budget for weights, in bytes.
torch_tensorrt.save(
compiled, "model.pte", arg_inputs=example_inputs,
output_format="executorch",
weight_streaming_budget=8 * 1024**3, # 8 GiB
)
Runtime (C++)
No change for users. Loading and running the program applies the budget automatically:
Module module("model.pte");
auto outputs = module.forward(inputs);
Backward compatibility
This is designed so that existing programs keep working:
- The new code only runs when the engine was built for weight streaming. Engines built with the default settings report zero streamable weights and skip the new path, so they behave exactly as before. This covers every
.pte produced so far, because streaming never took effect through this delegate before.
- There is no change to the
.pte format or the engine blob. Old runtimes can read new files and new runtimes can read old files.
The one intended behavior change is that an engine built with enable_weight_streaming=True now applies the automatic budget when loaded through the delegate, which is what enables the large model case. This matches the existing PyTorch runtimes.
One honest note. With the automatic budget, memory use depends on the free GPU memory at load time, and the chosen budget can differ across machines and TensorRT versions. For reproducible behavior on a known device, set a fixed budget in bytes.
Edge cases
- A budget set on an engine that was not built for streaming is ignored with a log message.
- A fixed budget larger than the streamable size is clamped.
- A malformed or negative budget is rejected at export time and at load time.
- If the automatic budget cannot be applied, the runtime retries with maximum streaming before failing.
- When a model is split into more than one TensorRT engine, a fixed byte budget applies to each engine. Leave
weight_streaming_budget as None (automatic) for multi-engine models, since the automatic budget sizes each engine on its own.
TensorRT APIs used
getStreamableWeightsSize, getWeightStreamingAutomaticBudget, setWeightStreamingBudgetV2, and getWeightStreamingBudgetV2. These are already used by the existing Torch-TensorRT runtime.
Files involved
cpp/src/torch_tensorrt/executorch/TensorRTBackend.cpp: set the budget in init().
cpp/include and cpp/src WeightStreamingBudget: a small standalone parser for the budget value, so it can be unit tested without a GPU.
py/torch_tensorrt/_compile.py and py/torch_tensorrt/executorch/partitioner.py: accept the option and carry it as a CompileSpec.
Alternatives considered
- Store the budget inside the Torch-TensorRT engine blob instead of a CompileSpec. This would require changing the blob format and its parser, and it puts a per-deployment setting inside a portable artifact. The CompileSpec approach avoids both.
- A live API to change the budget after the program is loaded. This does not fit the current ExecuTorch backend interface well, since the budget must be set before the execution context exists, so it is left out.
Possible follow-up
A per-deployment override could be added later using ExecuTorch load-time backend options, so a deployment can choose a fixed budget without re-exporting. This is not needed for the core feature, because the automatic budget already adapts to the deploy GPU. It can be added if a concrete need appears.
Tests
- CPU unit tests for the budget value parser.
- CPU tests for the export option and the default behavior.
- GPU tests that export with a budget, load, run, and compare outputs to eager, including a model whose weights are larger than the budget to prove the main goal.
Feature
Add TensorRT weight streaming support to the Torch-TensorRT ExecuTorch delegate runtime. This lets a model whose weights do not all fit in GPU memory still run after it is exported to an ExecuTorch program (a
.ptefile).Problem
Torch-TensorRT can already build a weight streamable engine when you compile with
enable_weight_streaming=True, and it can export to ExecuTorch withtorch_tensorrt.save(..., output_format="executorch").The problem is that the ExecuTorch delegate never sets a weight streaming budget. After it deserializes the engine it goes straight to creating the execution context. With no budget set, TensorRT keeps all weights resident on the GPU, so a model that is larger than GPU memory fails to run through the ExecuTorch path, even when the engine was built for streaming.
The other Torch-TensorRT runtimes already set a budget right after deserializing the engine, which is why they can stream. The ExecuTorch delegate was missing this one step.
Solution
Set the weight streaming budget inside the ExecuTorch delegate
init(), after the engine is deserialized and before the execution context is created. This matches the other runtimes and respects the TensorRT rule that the budget cannot change once a context exists.By default the delegate applies TensorRT's automatic budget, which TensorRT computes at load time from the free memory on the actual GPU. So an engine built with
enable_weight_streaming=Trueruns out of the box, and the budget adapts to whatever device the.pteis deployed on. This is the same thing the existing PyTorch runtimes do right after they deserialize an engine.An optional explicit budget can be set at export time. It travels to the runtime as an ExecuTorch CompileSpec, which is a key and value that ExecuTorch already stores in the
.pteand passes back to the backend at load time. This needs no changes to ExecuTorch and no change to the engine blob format.The option
weight_streaming_budgetonsave(..., output_format="executorch"):None(the default): the delegate applies TensorRT's automatic budget at load.This mirrors the byte budget used by the existing Torch-TensorRT runtimes. There is no string vocabulary to remember.
Ahead of time (Python)
Runtime (C++)
No change for users. Loading and running the program applies the budget automatically:
Backward compatibility
This is designed so that existing programs keep working:
.pteproduced so far, because streaming never took effect through this delegate before..pteformat or the engine blob. Old runtimes can read new files and new runtimes can read old files.The one intended behavior change is that an engine built with
enable_weight_streaming=Truenow applies the automatic budget when loaded through the delegate, which is what enables the large model case. This matches the existing PyTorch runtimes.One honest note. With the automatic budget, memory use depends on the free GPU memory at load time, and the chosen budget can differ across machines and TensorRT versions. For reproducible behavior on a known device, set a fixed budget in bytes.
Edge cases
weight_streaming_budgetasNone(automatic) for multi-engine models, since the automatic budget sizes each engine on its own.TensorRT APIs used
getStreamableWeightsSize,getWeightStreamingAutomaticBudget,setWeightStreamingBudgetV2, andgetWeightStreamingBudgetV2. These are already used by the existing Torch-TensorRT runtime.Files involved
cpp/src/torch_tensorrt/executorch/TensorRTBackend.cpp: set the budget ininit().cpp/includeandcpp/srcWeightStreamingBudget: a small standalone parser for the budget value, so it can be unit tested without a GPU.py/torch_tensorrt/_compile.pyandpy/torch_tensorrt/executorch/partitioner.py: accept the option and carry it as a CompileSpec.Alternatives considered
Possible follow-up
A per-deployment override could be added later using ExecuTorch load-time backend options, so a deployment can choose a fixed budget without re-exporting. This is not needed for the core feature, because the automatic budget already adapts to the deploy GPU. It can be added if a concrete need appears.
Tests