FlowRT

High-performance inference engine for flow matching models, built in C++17 and CUDA.

Flow matching models (like Diffusion Policy for robotics and FLUX.1 for image generation) generate outputs by running a neural network 50+ times in sequence. General inference engines treat each step independently. FlowRT is built around the structure of iterative generation, exploiting properties that general engines ignore.

Three Contributions

Persistent Trajectory Kernel — a single CUDA kernel runs all N denoising steps internally, keeping intermediate state L2-resident across the full trajectory. Eliminates N-1 kernel launch overheads and global memory round trips. Target: L2 hit rate from ~40% to over 85%.

Speculative Flow Matching — a small draft model proposes K steps ahead, the full model verifies all K in one batched pass. Acceptance criterion derived from ODE trajectory deviation bounds with a time-dependent threshold, tighter early in the trajectory where errors compound and looser late where the path is nearly linear.

Time-Conditioned INT8 Quantization — activation distributions shift significantly across timesteps. Calibrates 10 scale factors per layer (one per timestep bin) instead of one global scale. Recovers most quality lost by naive INT8 quantization.

Target

Under 10ms per sample on RTX 4090, enabling 100Hz real-time robot control. Projected 17x latency reduction over a PyTorch baseline on Diffusion Policy.

Stack

C++17 · CUDA 12.x · CUTLASS 3.x · ONNX Runtime · pybind11 · TensorRT · CMake

Supported Models

Diffusion Policy

Flow matching visuomotor policy. Obs+action concatenated input, 8-layer transformer, 50-step Euler denoising loop.

unifolm-vla DiT Action Head

Flow matching action head from unifolm-vla. Cross-attention DiT conditioned on Qwen2.5-VL backbone features. Supports the same VLM + flow matching DiT architecture as GR00T N1.5 and Pi0.

Export: python export_unifolm_dit.py

The export resolves four blockers that prevent naive ONNX export of the original unifolm-vla codebase:

Denoising loop unrolling — loop moved outside the model (SingleStepActionHead)
torch.randn inside inference path — noise is an external input argument
torch.autocast("cuda") in forward pass — removed; precision set at compile time
BatchFeature API boundary — plain tensor I/O throughout

The benchmark shows that even with a compiled ONNX model, the Python dispatch loop between steps adds per-step overhead that scales with N. Moving the loop to C++ — the core contribution of the OpenVINO RobotActionPipeline — eliminates this on Intel iGPU the same way FlowRT's persistent trajectory kernel eliminates it on NVIDIA.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
annotations		annotations
models		models
notes		notes
FlowRT Project Spec v2.pdf		FlowRT Project Spec v2.pdf
README.md		README.md
Untitled		Untitled
export_diffusion_policy.py		export_diffusion_policy.py
export_unifolm_dit.py		export_unifolm_dit.py
fused_ln_linear_time.cu		fused_ln_linear_time.cu
heun_integrator.cu		heun_integrator.cu
key		key
key.pub		key.pub
naive_gemm.cu		naive_gemm.cu
persistent_euler.cu		persistent_euler.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlowRT

Three Contributions

Target

Stack

Supported Models

Diffusion Policy

unifolm-vla DiT Action Head

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FlowRT

Three Contributions

Target

Stack

Supported Models

Diffusion Policy

unifolm-vla DiT Action Head

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages