Skip to content

[Question] Help with Client-Side Batching for Large Requests in Triton #818

Open
@harsh-boloai

Description

@harsh-boloai

I’m currently facing an issue with handling requests where the request batch size is greater than the model’s max_batch_size hosted in Triton. The chunking guide for PyTriton suggests it’s possible to address this, but I’m not sure how to implement it using triton client.

Related Open Issues

  • Installing pytriton includes Triton binaries, which I don’t need for client-side operations. I found this issue where others have mentioned the lack of a lightweight pytriton.client package. Any updates on this?

  • There’s an ongoing discussion in Triton server issue #4547 about handling large requests, but there haven’t been updates there either.

Questions

  1. How can I handle requests where the batch size exceeds the model’s max_batch_size? Specifically, I’d like to know how to split these large requests efficiently and send them to Triton in smaller batches.

  2. Could you provide a minimal working example using TritonClient?

    • I’ve seen the PyTriton example, which includes asynchronous support, but I’m looking for something similar with TritonClient.
    • If possible, an example using concurrent.futures or async functionality would be very helpful.
  3. Is there a plan to release a standalone pytriton.client package to avoid installing the full pytriton? Alternatively, is there a plan to include this batch splitting logic in Triton server itself?

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions