Inference Throughput vs. Batchsize 

I'm running a bert base onnx model with the tensorrt docker container.   My model has dynamic batchsize.   

```
docker pull nvcr.io/nvidia/tensorrt:21.09-py3
docker run --gpus 1 -it --rm -v $PWD:/workspace nvcr.io/nvidia/tensorrt:21.09-py3
```

Run trtexec 
`trtexec --onnx=./data/bert/bert-base-128.onnx --useCudaGraph --iterations=1000 --workspace=10000 --fp16 --optShapes=input_mask:1x128,segment_ids:1x128,input_ids:1x128 --verbose`

`trtexec --onnx=./data/bert/bert-base-128.onnx --useCudaGraph --iterations=1000 --workspace=10000 --fp16 --optShapes=input_mask:1x128,segment_ids:1x128,input_ids:1x128 --verbose`

Here is my table for throughput and latency

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=OneNote.File>
<meta name=Generator content="Microsoft OneNote 15">
</head>

<body lang=en-US style='font-family:Calibri;font-size:11.0pt'>


<div style='direction:ltr'>



  | bert-base |  
-- | -- | --
Input shapes | Throughput (qps) | Avg E2E Latency (ms)
1x128 | 1089.81 | 1.75901
2x128 | 921.662 | 2.09433
4x128 | 734.285 | 2.64538
8x128 | 480.935 | 4.04696
32x128 | 149.114 | 13.1884



</div>


</body>

</html>

Question:  How should I interpret the results?  Why throughput decreases with increasing batchsize?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inference Throughput vs. Batchsize #1593

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	bert-base
Input shapes	Throughput (qps)	Avg E2E Latency (ms)
1x128	1089.81	1.75901
2x128	921.662	2.09433
4x128	734.285	2.64538
8x128	480.935	4.04696
32x128	149.114	13.1884

Inference Throughput vs. Batchsize #1593

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions