Description
I’m currently facing an issue with handling requests where the request batch size is greater than the model’s max_batch_size hosted in Triton. The chunking guide for PyTriton suggests it’s possible to address this, but I’m not sure how to implement it using triton client.
Related Open Issues
-
Installing
pytriton
includes Triton binaries, which I don’t need for client-side operations. I found this issue where others have mentioned the lack of a lightweightpytriton.client
package. Any updates on this? -
There’s an ongoing discussion in Triton server issue #4547 about handling large requests, but there haven’t been updates there either.
Questions
-
How can I handle requests where the batch size exceeds the model’s
max_batch_size
? Specifically, I’d like to know how to split these large requests efficiently and send them to Triton in smaller batches. -
Could you provide a minimal working example using TritonClient?
- I’ve seen the PyTriton example, which includes asynchronous support, but I’m looking for something similar with TritonClient.
- If possible, an example using
concurrent.futures
or async functionality would be very helpful.
-
Is there a plan to release a standalone
pytriton.client
package to avoid installing the fullpytriton
? Alternatively, is there a plan to include this batch splitting logic in Triton server itself?
Thanks in advance!