Skip to content

Feature request: Gemma 4 model support #72

@KyosukeIchikawa

Description

@KyosukeIchikawa

Hi, thank you for the great work on TensorRT-Edge-LLM!

Detailed description of the requested feature

I'd like to request support for the Gemma 4 model family, particularly Gemma 4 E2B (2B parameters).

We're currently using Qwen3-VL-2B via Edge-LLM on Jetson Orin NX. While Edge-LLM's TensorRT optimization and vocabulary reduction work great with Qwen3-VL, we find that Gemma 4 E2B produces better quality output at the same 2B scale. We're currently running Gemma 4 E2B through llama.cpp as a workaround, but would love to leverage TensorRT optimization.

Vocabulary reduction support would be especially valuable — Gemma 4's 262k vocab makes decoding heavily memory-bandwidth-bound on Orin NX. The vocab reduction feature was a significant speedup for Qwen3-VL, and we'd expect even larger gains for Gemma 4 given the 8x larger default vocab.

I understand Gemma 3 support is currently the priority (#36). I also note that TensorRT-LLM (server-side) has begun merging Gemma 4 text support (NVIDIA/TensorRT-LLM#12808). Filing this to register edge-side interest for future planning.

Timeline

Nice to have. We have a working llama.cpp path in the meantime.

Describe alternatives you've considered

  • llama.cpp with GGUF Q4_K_XL — currently in use. Works, but misses TensorRT optimization and vocabulary reduction.
  • Qwen3-VL-2B via Edge-LLM — currently in use with good performance, but Gemma 4 E2B produces better quality output at the same parameter count.

Target hardware/use case

Jetson Orin NX 16GB, real-time VLM inference with image input.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions