Where does the reduction of TTFT comes from in LLaVa-UHD v3? Is it from ViT-UHD infer speed or visual token reduction in prefilling of LLM?

Thanks for the great work!

According to table 3 in the LLaVa-UHDv3 paper, adding WTC and RPE in siglip2-so400m-16 can reduce TTFT for 73ms.I've done a little experiment and found that adding RPE and WTC can make the original ViT model slower. Is there something wrong with my experiment?

Test details:
  gpu: L40s
  model: siglip2-so400m-384-patch16
  dtype: bf16  
  attention impl: flash attention 2
  inference 10 times after warming up and calculate the average infer speed. WTC in layer 27(pixel unshuffle layer) is omitted.

**input image shape: 512*512**
original siglip2 model:  
    speed: 12ms 
    tokens: 1024 visual tokens

change patchsize from 16 to 8, add 2 avg pooling layer before layer index 4 and 18 (ViT-UHD implementation)
    speed: 15ms
    tokens: 256 visual tokens


**input image shape: 1024*1024**
original siglip2 model:
    speed: 38ms
    tokens: 4096 

ViT-UHD implementation
    speed: 68ms
    tokens: 1024

ViT-UHD is clearly slower than siglip2 baseline in my implementation, is it true that the TTFT advantage comes from visual token reduction(75%)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where does the reduction of TTFT comes from in LLaVa-UHD v3? Is it from ViT-UHD infer speed or visual token reduction in prefilling of LLM? #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Where does the reduction of TTFT comes from in LLaVa-UHD v3? Is it from ViT-UHD infer speed or visual token reduction in prefilling of LLM? #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions