Hi, thank you for the great work on TensorRT-Edge-LLM!
Detailed description of the requested feature
I'd like to request support for the Gemma 4 model family, particularly Gemma 4 E2B (2B parameters).
We're currently using Qwen3-VL-2B via Edge-LLM on Jetson Orin NX. While Edge-LLM's TensorRT optimization and vocabulary reduction work great with Qwen3-VL, we find that Gemma 4 E2B produces better quality output at the same 2B scale. We're currently running Gemma 4 E2B through llama.cpp as a workaround, but would love to leverage TensorRT optimization.
Vocabulary reduction support would be especially valuable — Gemma 4's 262k vocab makes decoding heavily memory-bandwidth-bound on Orin NX. The vocab reduction feature was a significant speedup for Qwen3-VL, and we'd expect even larger gains for Gemma 4 given the 8x larger default vocab.
I understand Gemma 3 support is currently the priority (#36). I also note that TensorRT-LLM (server-side) has begun merging Gemma 4 text support (NVIDIA/TensorRT-LLM#12808). Filing this to register edge-side interest for future planning.
Timeline
Nice to have. We have a working llama.cpp path in the meantime.
Describe alternatives you've considered
- llama.cpp with GGUF Q4_K_XL — currently in use. Works, but misses TensorRT optimization and vocabulary reduction.
- Qwen3-VL-2B via Edge-LLM — currently in use with good performance, but Gemma 4 E2B produces better quality output at the same parameter count.
Target hardware/use case
Jetson Orin NX 16GB, real-time VLM inference with image input.
Hi, thank you for the great work on TensorRT-Edge-LLM!
Detailed description of the requested feature
I'd like to request support for the Gemma 4 model family, particularly Gemma 4 E2B (2B parameters).
We're currently using Qwen3-VL-2B via Edge-LLM on Jetson Orin NX. While Edge-LLM's TensorRT optimization and vocabulary reduction work great with Qwen3-VL, we find that Gemma 4 E2B produces better quality output at the same 2B scale. We're currently running Gemma 4 E2B through llama.cpp as a workaround, but would love to leverage TensorRT optimization.
Vocabulary reduction support would be especially valuable — Gemma 4's 262k vocab makes decoding heavily memory-bandwidth-bound on Orin NX. The vocab reduction feature was a significant speedup for Qwen3-VL, and we'd expect even larger gains for Gemma 4 given the 8x larger default vocab.
I understand Gemma 3 support is currently the priority (#36). I also note that TensorRT-LLM (server-side) has begun merging Gemma 4 text support (NVIDIA/TensorRT-LLM#12808). Filing this to register edge-side interest for future planning.
Timeline
Nice to have. We have a working llama.cpp path in the meantime.
Describe alternatives you've considered
Target hardware/use case
Jetson Orin NX 16GB, real-time VLM inference with image input.