Skip to content

Conversation

@jasonqinzhou
Copy link
Contributor

Overview:

Decode warmup step takes too long with too many batch sizes. It will also crash the deployment due to memory limit.
Reduce the batch_sizes to exponential and enable_padding for intermittent batch sizes.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@tianhaox
Copy link
Contributor

tianhaox commented Jan 15, 2026

this will hugely impact decode perf, we need to be smarter with this setting, do you see any problem with current setting? maybe we can prune some batchsize when bs is larger than a specifc value, say, 128

@jasonqinzhou
Copy link
Contributor Author

this will hugely impact decode perf, we need to be smarter with this setting, do you see any problem with current setting? maybe we can prune some batchsize when bs is larger than a specifc value, say, 128

The problem is this process takes a very long time for large batch size and it could also crash the deployment with OOM.
Sure we can try less aggressive approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants