Skip to content

Inference Throughput vs. Batchsize  #1593

@quic-nmorillo

Description

@quic-nmorillo

I'm running a bert base onnx model with the tensorrt docker container. My model has dynamic batchsize.

docker pull nvcr.io/nvidia/tensorrt:21.09-py3
docker run --gpus 1 -it --rm -v $PWD:/workspace nvcr.io/nvidia/tensorrt:21.09-py3

Run trtexec
trtexec --onnx=./data/bert/bert-base-128.onnx --useCudaGraph --iterations=1000 --workspace=10000 --fp16 --optShapes=input_mask:1x128,segment_ids:1x128,input_ids:1x128 --verbose

trtexec --onnx=./data/bert/bert-base-128.onnx --useCudaGraph --iterations=1000 --workspace=10000 --fp16 --optShapes=input_mask:1x128,segment_ids:1x128,input_ids:1x128 --verbose

Here is my table for throughput and latency

  bert-base  
Input shapes Throughput (qps) Avg E2E Latency (ms)
1x128 1089.81 1.75901
2x128 921.662 2.09433
4x128 734.285 2.64538
8x128 480.935 4.04696
32x128 149.114 13.1884

Question: How should I interpret the results? Why throughput decreases with increasing batchsize?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions