Skip to content

Marginal Improvement Between INT8 and FP16 #168

@alexriggio

Description

@alexriggio

I have quantized a BERT model for binary text classification and am only getting a marginal improvement in speed over FP16.

Tested on both an A4000 and A100 GPU.

A4000 --> TensorRT INT-8: 34.48ms, TensorRT FP16: 38.72ms
A100 ---> TensorRT INT-8: 11.53ms, TensorRT FP16: 11.75ms

These are the components that were quant disabled:

disable bert.encoder.layer.1.intermediate.dense._input_quantizer
disable bert.encoder.layer.2.attention.output.layernorm_quantizer_0
disable bert.encoder.layer.2.attention.output.layernorm_quantizer_1
disable bert.encoder.layer.2.output.layernorm_quantizer_0
disable bert.encoder.layer.2.output.layernorm_quantizer_1
disable bert.encoder.layer.3.attention.output.dense._input_quantizer
disable bert.encoder.layer.10.attention.self.key._input_quantizer
disable bert.encoder.layer.11.attention.output.dense._input_quantizer
disable bert.encoder.layer.11.output.dense._input_quantizer

The debug logs from the A4000 run are attached here:

trt_logs_int8_quantization.txt

Also, it looks like there is no option to quantize the embeddings. Is there a particular reason not to quantize them?

Any insight into these results is greatly appreciated. Thanks.

Versions:
Python: 3.10.9
transformers-deploy: 0.5.4
TensorRT: 8.4.1.5
Onnxruntime (GPU): 1.12.0
Cuda: 11.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions