Skip to content

Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314

Draft
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:fix/et-trt-caller-cuda-stream
Draft

Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:fix/et-trt-caller-cuda-stream

Conversation

@shoumikhin
Copy link
Copy Markdown
Contributor

Summary

The ExecuTorch TensorRT delegate created and owned a private CUDA stream and ran every enqueueV3() on it, so an application could not place inference on a specific CUDA stream or context — in particular a CUDA green context for SM partitioning.

This lets the caller select the stream, giving the libtorch-free ExecuTorch runtime the same caller-stream capability the libtorch TensorRT runtime gained in #4232.

Changes

  • Add a scoped CudaStreamGuard (mirroring c10::cuda::CUDAStreamGuard) to select, per calling thread, the CUDA stream the delegate runs TensorRT on. With no guard active the delegate runs on cudaStreamPerThread.
  • execute() runs enqueueV3() and the staging copies on the selected stream; init() no longer creates a stream (the delegate owns none).
  • Green context: scope a guard with a stream created on the green context via cuGreenCtxStreamCreate; the partition confinement travels with the stream, so the green context need not be made current. cudaStreamPerThread is invalid while a green context is current (cudaErrorInvalidResourceHandle), so a green-context caller must scope a guard.
  • cudaSetDevice() is applied only when the engine's device differs from the current device and is restored on exit, so it no longer clobbers a context the caller established.
  • Backward compatible: device-resident outputs are left enqueued (no end sync) only while a guard is active; the default path and host-staged outputs still synchronize before returning, preserving the prior "results ready on return" behavior.

Validation

Verified on an H100 (CUDA 12.8) with an %smid probe: a cuGreenCtxStreamCreate stream confines kernels to the green context's SM partition even when the primary context is current; cudaStreamPerThread errors with cudaErrorInvalidResourceHandle while a green context is current; the non-green default path uses the full device.

No dependency on the libtorch Torch-TensorRT runtime or libtorch is added.

Follow-up: a unit test for the stream selection (guarded vs. default) can be added.

…tream

The delegate created and owned a private CUDA stream in init() and ran every
enqueueV3() on it, so an application could not place inference on a specific
CUDA stream or context (for example a CUDA green context for SM partitioning).

Let the caller select the stream instead, bringing the libtorch-free ExecuTorch
runtime the same caller-stream capability the libtorch TensorRT runtime has
(pytorch#4232):

- Add a scoped CudaStreamGuard (mirroring c10::cuda::CUDAStreamGuard) to select,
  per calling thread, the CUDA stream the delegate runs TensorRT on. With no
  guard active the delegate runs on cudaStreamPerThread.
- execute() runs enqueueV3() and the staging copies on the selected stream;
  init() no longer creates a stream and the delegate owns none.
- To confine inference to a CUDA green context's SM partition the caller scopes a
  guard with a stream created on that green context (cuGreenCtxStreamCreate); the
  partition confinement travels with the stream, so the green context need not be
  made current. cudaStreamPerThread is invalid while a green context is current
  (cudaErrorInvalidResourceHandle), so a green-context caller must scope a guard.
- cudaSetDevice() is applied only when the engine's device differs from the
  current device and is restored on exit, so it no longer clobbers a context the
  caller established.
- execute() leaves device-resident outputs enqueued (no end sync) only while a
  guard is active; the default path and host-staged outputs still synchronize
  before returning, preserving existing behavior. The caller synchronizes the
  selected stream when it reads device-resident results.

No dependency on the libtorch Torch-TensorRT runtime or libtorch is added.
@meta-cla meta-cla Bot added the cla signed label May 30, 2026
@github-actions github-actions Bot added the component: api [C++] Issues re: C++ API label May 30, 2026
@github-actions github-actions Bot requested a review from narendasan May 30, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant