I have quantized a BERT model for binary text classification and am only getting a marginal improvement in speed over FP16.
Tested on both an A4000 and A100 GPU.
A4000 --> TensorRT INT-8: 34.48ms, TensorRT FP16: 38.72ms
A100 ---> TensorRT INT-8: 11.53ms, TensorRT FP16: 11.75ms
These are the components that were quant disabled:
disable bert.encoder.layer.1.intermediate.dense._input_quantizer
disable bert.encoder.layer.2.attention.output.layernorm_quantizer_0
disable bert.encoder.layer.2.attention.output.layernorm_quantizer_1
disable bert.encoder.layer.2.output.layernorm_quantizer_0
disable bert.encoder.layer.2.output.layernorm_quantizer_1
disable bert.encoder.layer.3.attention.output.dense._input_quantizer
disable bert.encoder.layer.10.attention.self.key._input_quantizer
disable bert.encoder.layer.11.attention.output.dense._input_quantizer
disable bert.encoder.layer.11.output.dense._input_quantizer
The debug logs from the A4000 run are attached here:
trt_logs_int8_quantization.txt
Also, it looks like there is no option to quantize the embeddings. Is there a particular reason not to quantize them?
Any insight into these results is greatly appreciated. Thanks.
Versions:
Python: 3.10.9
transformers-deploy: 0.5.4
TensorRT: 8.4.1.5
Onnxruntime (GPU): 1.12.0
Cuda: 11.7
I have quantized a BERT model for binary text classification and am only getting a marginal improvement in speed over FP16.
Tested on both an A4000 and A100 GPU.
A4000 --> TensorRT INT-8: 34.48ms, TensorRT FP16: 38.72ms
A100 ---> TensorRT INT-8: 11.53ms, TensorRT FP16: 11.75ms
These are the components that were quant disabled:
disable bert.encoder.layer.1.intermediate.dense._input_quantizer
disable bert.encoder.layer.2.attention.output.layernorm_quantizer_0
disable bert.encoder.layer.2.attention.output.layernorm_quantizer_1
disable bert.encoder.layer.2.output.layernorm_quantizer_0
disable bert.encoder.layer.2.output.layernorm_quantizer_1
disable bert.encoder.layer.3.attention.output.dense._input_quantizer
disable bert.encoder.layer.10.attention.self.key._input_quantizer
disable bert.encoder.layer.11.attention.output.dense._input_quantizer
disable bert.encoder.layer.11.output.dense._input_quantizer
The debug logs from the A4000 run are attached here:
trt_logs_int8_quantization.txt
Also, it looks like there is no option to quantize the embeddings. Is there a particular reason not to quantize them?
Any insight into these results is greatly appreciated. Thanks.
Versions:
Python: 3.10.9
transformers-deploy: 0.5.4
TensorRT: 8.4.1.5
Onnxruntime (GPU): 1.12.0
Cuda: 11.7