|
| 1 | +# Fast Arbitrary Precision Floating Point on FPGA |
| 2 | + |
| 3 | +A detailed description of the approach implemented in this repository can be |
| 4 | +found in our [FCCM'22 |
| 5 | +paper](https://spcl.inf.ethz.ch/Publications/.pdf/apfp.pdf) [1]. |
| 6 | + |
| 7 | +## Introduction |
| 8 | + |
| 9 | +This repository implements an arbitrary precision floating point multiplier and |
| 10 | +adder using Vitis HLS targeting XRT-enabled Xilinx FPGAs, exposing them through |
| 11 | +a matrix multiplication primitive that allows running them at full throughput |
| 12 | +without becoming memory bound. The design is _fully pipelined_, yielding a MAC |
| 13 | +throughput equivalent to the frequency times the number of compute units |
| 14 | +instantiated. |
| 15 | + |
| 16 | +Instantiations of the design on an Alveo U250 accelerator were shown to yield |
| 17 | +2.0 GMAC/s of 512-bit matrix-matrix multiplication; an order of magnitude |
| 18 | +higher than a 36-core dual-socket Xeon node, corresponding to 375× CPU cores |
| 19 | +worth of throughput [1]. |
| 20 | + |
| 21 | +## Configuration |
| 22 | + |
| 23 | +The hardware design is configured using CMake. The target Xilinx XRT-enabled |
| 24 | +platform must be specified with the `APFP_PLATFORM` parameter. The most |
| 25 | +important configuration parameters include: |
| 26 | +- The width used for the floating point representation is fixed at compile-time |
| 27 | + using the `APFP_BITS` CMake parameter, out of which 63 bits will be used for |
| 28 | + the exponent, 1 bit will be used for the sign, and the remaining bits will be |
| 29 | + used for the mantissa. The value is currently expected to be a multiple of 512 |
| 30 | + for the sake of being aligned to the memory interface width. |
| 31 | +- To scale the design beyond a single pipelined multiplier, the |
| 32 | + `APFP_COMPUTE_UNITS` can be used to replicate the full kernel. Each |
| 33 | + instantiation will run a fully independent matrix multiplication unit. These |
| 34 | + can be used to collaborate on a single matrix multiplication operation (see |
| 35 | + `host/TestMatrixMultiplication.cpp` for an example. |
| 36 | +- The floating point multiplier uses Karatsuba decomposition to reduce the |
| 37 | + overall resource usage of the design. The decomposition bottoms out at |
| 38 | + `APFP_MULT_BASE_BITS`, after which it falls back on naive multiplication using |
| 39 | + DSPs as generated by the HLS tool. Similarly, the `APFP_ADD_BASE_BITS` |
| 40 | + configures the number of bits to dispatch to the HLS tool's addition |
| 41 | + implementation, manually pipelining the addition into multiple stages above |
| 42 | + this threshold. |
| 43 | +- To avoid being memory bound, the matrix multiplication implementation is |
| 44 | + tiled using the approach described in our [FPGA'20 |
| 45 | + paper](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf) [2]. The |
| 46 | + tile sizes are exposed through the `APFP_TILE_SIZE_N` and `APFP_TILE_SIZE_M` |
| 47 | + parameters. The highest arithmetic intensity is achieved when these two |
| 48 | + quantities are equal and maximized, but relatively small tile sizes are |
| 49 | + sufficient to overcome the memory bottleneck (e.g., 32x32). Higher tile sizes |
| 50 | + increase arithmetic intensity at the cost of BRAM usage, and potential |
| 51 | + overhead when the input matrix is not a multiple of the tile size. |
| 52 | +- `APFP_FREQUENCY` can be used to change the maximum frequency targeted by the |
| 53 | + design. If unspecified, the default of the target platform will be used. |
| 54 | + |
| 55 | +For more details on how to configure the project to achieve high throughput, |
| 56 | +see our paper [1]. |
| 57 | + |
| 58 | +## Compilation |
| 59 | + |
| 60 | +The minimum commands necessary to configure and build the code are: |
| 61 | + |
| 62 | +```bash |
| 63 | +mkdir build |
| 64 | +cd build |
| 65 | +cmake .. # Default parameters |
| 66 | +make # Builds software components |
| 67 | +make hw # Builds hardware accelerator |
| 68 | +``` |
| 69 | + |
| 70 | +However, the accelerator should always be configured to match the target system |
| 71 | +using the parameters described in the previous section and in our paper [1]. |
| 72 | +The CMake configuration flow uses |
| 73 | +[hlslib](https://github.com/definelicht/hlslib) [3] to locate the Xilinx tools |
| 74 | +and expose hardware build targets. |
| 75 | + |
| 76 | +The project depends on Vitis, GMP, and MPFR to successfully configure. |
| 77 | + |
| 78 | +## Running the code |
| 79 | + |
| 80 | +We provide an example host code that runs the matrix multiplication accelerator |
| 81 | +on a randomized input in `host/TestMatrixMultiplication.cpp`. See the executable |
| 82 | +for usage. An example invocation could be: |
| 83 | + |
| 84 | +```bash |
| 85 | +./TestMatrixMultiplicationHardware hw 256 256 256 |
| 86 | +``` |
| 87 | + |
| 88 | +## Installation |
| 89 | + |
| 90 | +To install the project, including both the software interface components and the |
| 91 | +hardware accelerator itself (built with `make hw`), simply run `make install`. |
| 92 | +The location to install the project in is configured with the |
| 93 | +`CMAKE_INSTALL_PREFIX` parameter. |
| 94 | + |
| 95 | +## References |
| 96 | + |
| 97 | +[1] Johannes de Fine Licht, Christopher A. Pattison, Alexandros Nikolaos |
| 98 | +Ziogas, David Simmons-Duffin, Torsten Hoefler, _"Fast Arbitrary Precision |
| 99 | +Floating Point on FPGA"_, in Proceedings of the 2022 IEEE 30th Annual |
| 100 | +International Symposium on Field-Programmable Custom Computing Machines |
| 101 | +(FCCM'22). [🔗](https://spcl.inf.ethz.ch/Publications/.pdf/apfp.pdf) |
| 102 | + |
| 103 | +[2] Johannes de Fine Licht, Grzegorz Kwasniewski, and Torsten Hoefler, |
| 104 | +_"Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level |
| 105 | +Synthesis"_, in Proceedings of 28th ACM/SIGDA International Symposium on |
| 106 | +Field-Programmable Gate Arrays (FPGA'20). |
| 107 | +[🔗](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf) |
| 108 | + |
| 109 | +[3] Johannes de Fine Licht, and Torsten Hoefler. _"hlslib: Software Engineering |
| 110 | +for Hardware Design."_, presented at the Fifth International Workshop on |
| 111 | +Heterogeneous High-performance Reconfigurable Computing (H2RC'19). |
| 112 | +[🔗](https://spcl.inf.ethz.ch/Publications/.pdf/hlslib.pdf) |
0 commit comments