-
Notifications
You must be signed in to change notification settings - Fork 20
Description
I have a model that I want to statically quantize for all the benefits that brings. However, the quality suffered when I did so. First I excluded sensitive layers, but that wasn't enough. Then I had it suggested that I should keep float input/output and exclude the first and last x layers.
I have been trying to do this, but I cannot get rid of a quantize op at the end, which then causes a crash at runtime when allocating output buffers.
My recipe excludes all the last layers (plus more, unshown):
# Exclude output/final layers (exact tensor names)
rp_manager.add_quantization_config(
regex='.*Linear_projector;1',
operation_name=qtyping.TFLOperationName.ALL_SUPPORTED,
algorithm_key=algorithm_manager.AlgorithmName.NO_QUANTIZE,
)
rp_manager.add_quantization_config(
regex='.*WavLMForSequenceClassification;1',
operation_name=qtyping.TFLOperationName.ALL_SUPPORTED,
algorithm_key=algorithm_manager.AlgorithmName.NO_QUANTIZE,
)
rp_manager.add_quantization_config(
regex='.StatefulPartitionedCall.',
operation_name=qtyping.TFLOperationName.ALL_SUPPORTED,
algorithm_key=algorithm_manager.AlgorithmName.NO_QUANTIZE,
)
rp_manager.add_quantization_config(
regex='.logits.',
operation_name=qtyping.TFLOperationName.ALL_SUPPORTED,
algorithm_key=algorithm_manager.AlgorithmName.NO_QUANTIZE,
)
But somehow this Quantize op is still there. What am I missing?
Before all my exclusions, my recipe starts with this:
# First: Quantize ALL supported ops to static int8 by default
rp_manager.add_static_config(
regex='.*',
operation_name=qtyping.TFLOperationName.ALL_SUPPORTED,
activation_num_bits=8,
weight_num_bits=8,
algorithm_key=algorithm_manager.AlgorithmName.MIN_MAX_UNIFORM_QUANT,
)
What am I doing wrong, or is my aim not possible?