Skip to content

Where does the reduction of TTFT comes from in LLaVa-UHD v3? Is it from ViT-UHD infer speed or visual token reduction in prefilling of LLM? #37

@lawsonxwl

Description

@lawsonxwl

Thanks for the great work!

According to table 3 in the LLaVa-UHDv3 paper, adding WTC and RPE in siglip2-so400m-16 can reduce TTFT for 73ms.I've done a little experiment and found that adding RPE and WTC can make the original ViT model slower. Is there something wrong with my experiment?

Test details:
gpu: L40s
model: siglip2-so400m-384-patch16
dtype: bf16
attention impl: flash attention 2
inference 10 times after warming up and calculate the average infer speed. WTC in layer 27(pixel unshuffle layer) is omitted.

input image shape: 512*512
original siglip2 model:
speed: 12ms
tokens: 1024 visual tokens

change patchsize from 16 to 8, add 2 avg pooling layer before layer index 4 and 18 (ViT-UHD implementation)
speed: 15ms
tokens: 256 visual tokens

input image shape: 1024*1024
original siglip2 model:
speed: 38ms
tokens: 4096

ViT-UHD implementation
speed: 68ms
tokens: 1024

ViT-UHD is clearly slower than siglip2 baseline in my implementation, is it true that the TTFT advantage comes from visual token reduction(75%)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions