Marginal Improvement Between INT8 and FP16

I have quantized a BERT model for binary text classification and am only getting a marginal improvement in speed over FP16.

Tested on both an A4000 and A100 GPU.

A4000 --> TensorRT INT-8: 34.48ms, TensorRT FP16: 38.72ms
A100 ---> TensorRT INT-8: 11.53ms, TensorRT FP16: 11.75ms

These are the components that were quant disabled:

disable bert.encoder.layer.1.intermediate.dense._input_quantizer
disable bert.encoder.layer.2.attention.output.layernorm_quantizer_0
disable bert.encoder.layer.2.attention.output.layernorm_quantizer_1
disable bert.encoder.layer.2.output.layernorm_quantizer_0
disable bert.encoder.layer.2.output.layernorm_quantizer_1
disable bert.encoder.layer.3.attention.output.dense._input_quantizer
disable bert.encoder.layer.10.attention.self.key._input_quantizer
disable bert.encoder.layer.11.attention.output.dense._input_quantizer
disable bert.encoder.layer.11.output.dense._input_quantizer

The debug logs from the A4000 run are attached here: 

[trt_logs_int8_quantization.txt](https://github.com/ELS-RD/transformer-deploy/files/11123739/trt_logs_int8_quantization.txt)

Also, it looks like there is no option to quantize the embeddings. Is there a particular reason not to quantize them?

Any insight into these results is greatly appreciated. Thanks.

Versions:
Python: 3.10.9
transformers-deploy: 0.5.4
TensorRT: 8.4.1.5
Onnxruntime (GPU): 1.12.0
Cuda: 11.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marginal Improvement Between INT8 and FP16 #168

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Marginal Improvement Between INT8 and FP16 #168

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions