Support TensorRT weight streaming in the ExecuTorch delegate so large models can run

## Feature

Add TensorRT weight streaming support to the Torch-TensorRT ExecuTorch delegate runtime. This lets a model whose weights do not all fit in GPU memory still run after it is exported to an ExecuTorch program (a `.pte` file).

## Problem

Torch-TensorRT can already build a weight streamable engine when you compile with `enable_weight_streaming=True`, and it can export to ExecuTorch with `torch_tensorrt.save(..., output_format="executorch")`.

The problem is that the ExecuTorch delegate never sets a weight streaming budget. After it deserializes the engine it goes straight to creating the execution context. With no budget set, TensorRT keeps all weights resident on the GPU, so a model that is larger than GPU memory fails to run through the ExecuTorch path, even when the engine was built for streaming.

The other Torch-TensorRT runtimes already set a budget right after deserializing the engine, which is why they can stream. The ExecuTorch delegate was missing this one step.

## Solution

Set the weight streaming budget inside the ExecuTorch delegate `init()`, after the engine is deserialized and before the execution context is created. This matches the other runtimes and respects the TensorRT rule that the budget cannot change once a context exists.

By default the delegate applies TensorRT's automatic budget, which TensorRT computes at load time from the free memory on the actual GPU. So an engine built with `enable_weight_streaming=True` runs out of the box, and the budget adapts to whatever device the `.pte` is deployed on. This is the same thing the existing PyTorch runtimes do right after they deserialize an engine.

An optional explicit budget can be set at export time. It travels to the runtime as an ExecuTorch CompileSpec, which is a key and value that ExecuTorch already stores in the `.pte` and passes back to the backend at load time. This needs no changes to ExecuTorch and no change to the engine blob format.

### The option

`weight_streaming_budget` on `save(..., output_format="executorch")`:

* `None` (the default): the delegate applies TensorRT's automatic budget at load.
* a non-negative integer: a fixed GPU budget for weights, in bytes.

This mirrors the byte budget used by the existing Torch-TensorRT runtimes. There is no string vocabulary to remember.

### Ahead of time (Python)

```python
import torch_tensorrt

compiled = torch_tensorrt.compile(
    model, arg_inputs=example_inputs, enable_weight_streaming=True
)

# Works out of the box. The delegate applies the automatic budget at load,
# so a model whose weights exceed GPU memory runs.
torch_tensorrt.save(
    compiled, "model.pte", arg_inputs=example_inputs, output_format="executorch"
)

# Optional: pin an explicit GPU budget for weights, in bytes.
torch_tensorrt.save(
    compiled, "model.pte", arg_inputs=example_inputs,
    output_format="executorch",
    weight_streaming_budget=8 * 1024**3,  # 8 GiB
)
```

### Runtime (C++)

No change for users. Loading and running the program applies the budget automatically:

```cpp
Module module("model.pte");
auto outputs = module.forward(inputs);
```

## Backward compatibility

This is designed so that existing programs keep working:

* The new code only runs when the engine was built for weight streaming. Engines built with the default settings report zero streamable weights and skip the new path, so they behave exactly as before. This covers every `.pte` produced so far, because streaming never took effect through this delegate before.
* There is no change to the `.pte` format or the engine blob. Old runtimes can read new files and new runtimes can read old files.

The one intended behavior change is that an engine built with `enable_weight_streaming=True` now applies the automatic budget when loaded through the delegate, which is what enables the large model case. This matches the existing PyTorch runtimes.

One honest note. With the automatic budget, memory use depends on the free GPU memory at load time, and the chosen budget can differ across machines and TensorRT versions. For reproducible behavior on a known device, set a fixed budget in bytes.

## Edge cases

* A budget set on an engine that was not built for streaming is ignored with a log message.
* A fixed budget larger than the streamable size is clamped.
* A malformed or negative budget is rejected at export time and at load time.
* If the automatic budget cannot be applied, the runtime retries with maximum streaming before failing.
* When a model is split into more than one TensorRT engine, a fixed byte budget applies to each engine. Leave `weight_streaming_budget` as `None` (automatic) for multi-engine models, since the automatic budget sizes each engine on its own.

## TensorRT APIs used

`getStreamableWeightsSize`, `getWeightStreamingAutomaticBudget`, `setWeightStreamingBudgetV2`, and `getWeightStreamingBudgetV2`. These are already used by the existing Torch-TensorRT runtime.

## Files involved

* `cpp/src/torch_tensorrt/executorch/TensorRTBackend.cpp`: set the budget in `init()`.
* `cpp/include` and `cpp/src` `WeightStreamingBudget`: a small standalone parser for the budget value, so it can be unit tested without a GPU.
* `py/torch_tensorrt/_compile.py` and `py/torch_tensorrt/executorch/partitioner.py`: accept the option and carry it as a CompileSpec.

## Alternatives considered

* Store the budget inside the Torch-TensorRT engine blob instead of a CompileSpec. This would require changing the blob format and its parser, and it puts a per-deployment setting inside a portable artifact. The CompileSpec approach avoids both.
* A live API to change the budget after the program is loaded. This does not fit the current ExecuTorch backend interface well, since the budget must be set before the execution context exists, so it is left out.

## Possible follow-up

A per-deployment override could be added later using ExecuTorch load-time backend options, so a deployment can choose a fixed budget without re-exporting. This is not needed for the core feature, because the automatic budget already adapts to the deploy GPU. It can be added if a concrete need appears.

## Tests

* CPU unit tests for the budget value parser.
* CPU tests for the export option and the default behavior.
* GPU tests that export with a budget, load, run, and compare outputs to eager, including a model whose weights are larger than the budget to prove the main goal.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support TensorRT weight streaming in the ExecuTorch delegate so large models can run #4334

Feature

Problem

Solution

The option

Ahead of time (Python)

Runtime (C++)

Backward compatibility

Edge cases

TensorRT APIs used

Files involved

Alternatives considered

Possible follow-up

Tests

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support TensorRT weight streaming in the ExecuTorch delegate so large models can run #4334

Description

Feature

Problem

Solution

The option

Ahead of time (Python)

Runtime (C++)

Backward compatibility

Edge cases

TensorRT APIs used

Files involved

Alternatives considered

Possible follow-up

Tests

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions