Thanks for the great work!
According to table 3 in the LLaVa-UHDv3 paper, adding WTC and RPE in siglip2-so400m-16 can reduce TTFT for 73ms.I've done a little experiment and found that adding RPE and WTC can make the original ViT model slower. Is there something wrong with my experiment?
Test details:
gpu: L40s
model: siglip2-so400m-384-patch16
dtype: bf16
attention impl: flash attention 2
inference 10 times after warming up and calculate the average infer speed. WTC in layer 27(pixel unshuffle layer) is omitted.
input image shape: 512*512
original siglip2 model:
speed: 12ms
tokens: 1024 visual tokens
change patchsize from 16 to 8, add 2 avg pooling layer before layer index 4 and 18 (ViT-UHD implementation)
speed: 15ms
tokens: 256 visual tokens
input image shape: 1024*1024
original siglip2 model:
speed: 38ms
tokens: 4096
ViT-UHD implementation
speed: 68ms
tokens: 1024
ViT-UHD is clearly slower than siglip2 baseline in my implementation, is it true that the TTFT advantage comes from visual token reduction(75%)?