[Feature] move the waiting queue to tokenizer

Currently, the waiting queue is implemented inside the `DPScheduler`. However, the current implementation will make the request to be sent one by one, which may potentially reduce the effective batch size at the first layer of attention. 

Putting the waiting queue in the tokenizer and gather the tokens as a full batch may ease this issue.