PositTune Final Report

Yimin Gao, Zhenghong Chen, Shan He

Introduction

PositTune is a project that explores the design space of mixed-configuration posit quantization on ML models using Qtorch plus implementation as our platform. Since posit is a relatively new concept, there is no real system-level hardware implementation in this project, but we wish with promising results on mixed-configuration posit quantization, we can further motivate and prove the necessity of the development of a reconfigurable/versatile posit arithmetic unit in future work. GPUs are required to speed up the quantization process (equation transform from floating point to posit), or train a model entirely in posit. (a NVIDIA 2080 Super GPU is used for this project)

QPytorch is a Pytorch implementation of posit quantization so we can simulate the accuracy of the posit-quantized ML model without the need of having real posit hardware. The numbers (weights/activation) after quantizing to posit are still floating point in the hardware, but have the same/similar values to posit arithmetic following posit’s equation. However, the limit there is how you could emulate say posit<32,0> since it has a better accuracy than 32-bit floating point. But since we are only implementing low-bitwidth posit quantization such as 8-bit or 4-bit, that’s not a concern to this project.

Posit is this dynamic numerical representation that has the best accuracy near 1 (2^0), and a wider range compared to floating point of the same bitwidth. As shown in the figure below, floating point arithmetic shows the same level of accuracy across a wide range of numbers. However, this is usually redundant for most of the applications such as an ML inference where most of the weights/activation lie in a relatively small region. Posit, on the other hand, offers a more efficient design by demonstrating better accuracy around 1 and a wider range compared to floating point. There is this sweet spot around 1 where posit outperforms floating point in terms of accuracy and this is usually where we care the most for ML applications. However, what if the weight/activations for certain ML models go out of this region? Our solution is that we could dynamically shift the weights for each layer to where posit is the most accurate.

As an example, the figure below shows the weight distribution of the first layer of GPT2 model. As we can see, the center bin of the distribution is at 2^(-2.72). To benefit the most from posit arithmetic, we can shift this distribution (in log 2 scale) to the right where the distribution is centered around 1 (where posit is the most accurate), as shown in the following figure.

PositTune Tutorial

Here are the code we used to benchmark three models with layer-wise adaptive scaling during posit quantization:

GPT2
YOLO
ALBERT

While the detailed implementation of these posit quantization methods can be found in the original Qtorch paper, the same quantization method is applied in all three benchmarks - The weight and the activation of each layer for the selected model is quantized to posit arithmetic using a predefined quantization function from the qtorch_plus library: posit_quantize(). The function allows you to choose the total bitwidth (nsize), exponent bitwidth (es) and the scaling factor (scale) for the quantization process. In each of the three benchmark notebooks, there is a step where the program iterates through all the linear layers of the selected ML model to quantize the weights and the activation using the defined linear_weight and register_forward_prehook functions.

This is where we made our changes to apply layer-wise adaptive scaling to the weights during posit quantization. To take GPT2 as an example, below is the revised version of the code where we first find the center bin of the weight distribution for each layer (`x_with_max_frequency`) and apply this scaling factor (2^(-x_with_max_frequency)) in the posit quantization process.

Evaluation Results

GPT2 (Generative Pre-trained Transformer 2)

GPT2 is a large language model by OpenAI that generates coherent text by predicting the next word in a sequence. The Wikitext dataset is used for evaluating the perplexity of the quantized model, and the detailed perplexity result among various data types can be found in the table below. As shown in the table, even without adaptive scaling, posit(8,1) is already able to outperform 8-bit floating point in terms of accuracy. Then with the layer-wise adaptive scaling, the perplexity of the quantized model increases for each posit configuration. The improvement is more significant in certain configurations with a narrow accuracy distribution such as posit(6,0) due to that the scaling can help a lot to align the data to where this narrow posit configuration serves the best accuracy.

YOLO (You Only Look Once)

YOLOv5s is a deep neural network model used for real-time object detection. It is widely recognized in computer vision for its speed and efficiency, allowing it to detect objects in images or videos while predicting their class and location. We chose the v5s model, which has a size of 14MB, making it suitable for edge computing devices. Generally speaking, as the model size increases, the latency for processing each image also increases, but the average precision (AP) on Common Objects in Context (COCO) improves as well. YOLOv5s is a compact and efficient model designed specifically for speed and low resource consumption. It consists of 213 layers and approximately 7.2 million parameters. Although the model is small, it achieves a good balance between accuracy and speed in many real-world object detection tasks. With a computational complexity of only 16.4 GFLOPs, it is highly suitable for edge computing and real-time applications.

In the project, we determined the confidence value by comparing the results of running YOLOv5s with those of the larger YOLOv5x model and 32-bit floating-point computations. This value directly represents the model's accuracy. Additionally, we compared the results of processing a single image under different quantization schemes within the project. The model uses 16-bit floating-point precision, which we quantized into two different Posit schemes: 8-bit and 6-bit. By manually adjusting the scaling scheme and employing adaptive scaling, the algorithm automatically determined the optimal scale value, which improved the results and refined the conclusions.

ALBERT (A Little BERT)

ALBERT is a smaller, faster version of BERT that reduces model size and improves efficiency by sharing parameters across layers and using optimized embedding techniques. The task we implement to evaluate quantization performance of ALBERT is answering arbitrary questions according to given context. The result is valued as F1 score, which measures overlap between the predicted and ground-truth answers.

We tried to quantize linear weights with 32-bit floating point, 8-bit floating point, posit <6,1> without scaling, and posit <6,1> with the adaptive scaling method. As shown in the following table, the highest F1 score comes from 32-bit floating point quantization. However, the trade-off is more resources (32-bit operation) are needed. When we quantize with 8-bit floating point, the overall F1 score is slightly reduced from 86.19% to 85.36%, and the F1 score for questions with answers goes down from 86.32% to 82.86%. The F1 score for posit <6,1> without scaling dropped more. Nevertheless, if the adaptive scale was applied to posit <6,1>, the F1 score is higher than 8-bit floating point quantization. In conclusion, the use of posit quantization with the adaptive scaling method enables achieving strong performance even with a small bit size.

Conclusion and Lessons Learned

Throughout this project, we have learned that posit arithmetic, even without adaptive scaling, offers a highly promising alternative for edge inference that balances efficiency and precision. Moreover, incorporating layer-wise adaptive scaling in posit inference enables achieving the same accuracy with reduced bitwidth or higher accuracy at the same bitwidth. We hope this motivates further research into developing a versatile posit arithmetic unit with runtime-configurable scaling to further enhance energy efficiency in edge AI training and inference.

Milestone Report on 11/26:

PositTune Milestone

Final Report:

Final Report

Here are the jupyter notebooks we used to benchmark three models with layer-wise adaptive scaling during posit quantization:

All the codes we used for this project

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github		.github
QPyTorch		QPyTorch
imgs		imgs
Figure_1.pdf		Figure_1.pdf
Figure_1.png		Figure_1.png
GPT2_posit_quant.py		GPT2_posit_quant.py
GPT2_quant_mix_from_trail.py		GPT2_quant_mix_from_trail.py
GPT2_quantization_mix.ipynb		GPT2_quantization_mix.ipynb
GPT2_quantization_mix_backup0604.ipynb		GPT2_quantization_mix_backup0604.ipynb
GPT2_quantization_mix_backup0604_raw_quant.ipynb		GPT2_quantization_mix_backup0604_raw_quant.ipynb
POSIT_SCALING_Question_Answering_with_ALBERT_+_Posit_.ipynb		POSIT_SCALING_Question_Answering_with_ALBERT_+_Posit_.ipynb
Qtorch_plus_ultralytics_yolov5.ipynb		Qtorch_plus_ultralytics_yolov5.ipynb
README.md		README.md
distribution_aware_algorithm.ipynb		distribution_aware_algorithm.ipynb
gpt2_mix_posit.py		gpt2_mix_posit.py
gpt2_mix_posit_compare.py		gpt2_mix_posit_compare.py
gpt2_mix_posit_save_scale.py		gpt2_mix_posit_save_scale.py
gpt2_posit_uniform.py		gpt2_posit_uniform.py
gpt2_scale.ipynb		gpt2_scale.ipynb
gpt2_stats.py		gpt2_stats.py
group8_presentation.pptx		group8_presentation.pptx
int_posit_comp_gpt2_detail.py		int_posit_comp_gpt2_detail.py
int_posit_comp_on_gpt2.py		int_posit_comp_on_gpt2.py
layer_weight_stats.json		layer_weight_stats.json
nsize.py		nsize.py
posit_align_fig.py		posit_align_fig.py
posit_quantization_log.json		posit_quantization_log.json
posit_test.ipynb		posit_test.ipynb
quant_acc.ipynb		quant_acc.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PositTune Final Report

Introduction

PositTune Tutorial

Evaluation Results

GPT2 (Generative Pre-trained Transformer 2)

YOLO (You Only Look Once)

ALBERT (A Little BERT)

Conclusion and Lessons Learned

Milestone Report on 11/26:

Final Report:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

hplp/ai-hardware-project-posittune

Folders and files

Latest commit

History

Repository files navigation

PositTune Final Report

Introduction

PositTune Tutorial

Evaluation Results

GPT2 (Generative Pre-trained Transformer 2)

YOLO (You Only Look Once)

ALBERT (A Little BERT)

Conclusion and Lessons Learned

Milestone Report on 11/26:

Final Report:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages