<img width="1508" height="781" alt="Image" src="https://github.com/user-attachments/assets/5c121a87-bc69-4238-9154-2a7693244544" /> <img width="1501" height="365" alt="Image" src="https://github.com/user-attachments/assets/8496f35a-93c4-4175-833c-1c9918a64f83" /> according to the simulation: for deepseek v3 671b, dp=ep=128 the moe dispatch and combine took up over 90% latency, which is unreasonable