Skip to content

Commit 43d8a88

Browse files
committed
Added README
1 parent 8c5d03f commit 43d8a88

File tree

1 file changed

+112
-0
lines changed

1 file changed

+112
-0
lines changed

README.md

+112
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Fast Arbitrary Precision Floating Point on FPGA
2+
3+
A detailed description of the approach implemented in this repository can be
4+
found in our [FCCM'22
5+
paper](https://spcl.inf.ethz.ch/Publications/.pdf/apfp.pdf) [1].
6+
7+
## Introduction
8+
9+
This repository implements an arbitrary precision floating point multiplier and
10+
adder using Vitis HLS targeting XRT-enabled Xilinx FPGAs, exposing them through
11+
a matrix multiplication primitive that allows running them at full throughput
12+
without becoming memory bound. The design is _fully pipelined_, yielding a MAC
13+
throughput equivalent to the frequency times the number of compute units
14+
instantiated.
15+
16+
Instantiations of the design on an Alveo U250 accelerator were shown to yield
17+
2.0 GMAC/s of 512-bit matrix-matrix multiplication; an order of magnitude
18+
higher than a 36-core dual-socket Xeon node, corresponding to 375× CPU cores
19+
worth of throughput [1].
20+
21+
## Configuration
22+
23+
The hardware design is configured using CMake. The target Xilinx XRT-enabled
24+
platform must be specified with the `APFP_PLATFORM` parameter. The most
25+
important configuration parameters include:
26+
- The width used for the floating point representation is fixed at compile-time
27+
using the `APFP_BITS` CMake parameter, out of which 63 bits will be used for
28+
the exponent, 1 bit will be used for the sign, and the remaining bits will be
29+
used for the mantissa. The value is currently expected to be a multiple of 512
30+
for the sake of being aligned to the memory interface width.
31+
- To scale the design beyond a single pipelined multiplier, the
32+
`APFP_COMPUTE_UNITS` can be used to replicate the full kernel. Each
33+
instantiation will run a fully independent matrix multiplication unit. These
34+
can be used to collaborate on a single matrix multiplication operation (see
35+
`host/TestMatrixMultiplication.cpp` for an example.
36+
- The floating point multiplier uses Karatsuba decomposition to reduce the
37+
overall resource usage of the design. The decomposition bottoms out at
38+
`APFP_MULT_BASE_BITS`, after which it falls back on naive multiplication using
39+
DSPs as generated by the HLS tool. Similarly, the `APFP_ADD_BASE_BITS`
40+
configures the number of bits to dispatch to the HLS tool's addition
41+
implementation, manually pipelining the addition into multiple stages above
42+
this threshold.
43+
- To avoid being memory bound, the matrix multiplication implementation is
44+
tiled using the approach described in our [FPGA'20
45+
paper](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf) [2]. The
46+
tile sizes are exposed through the `APFP_TILE_SIZE_N` and `APFP_TILE_SIZE_M`
47+
parameters. The highest arithmetic intensity is achieved when these two
48+
quantities are equal and maximized, but relatively small tile sizes are
49+
sufficient to overcome the memory bottleneck (e.g., 32x32). Higher tile sizes
50+
increase arithmetic intensity at the cost of BRAM usage, and potential
51+
overhead when the input matrix is not a multiple of the tile size.
52+
- `APFP_FREQUENCY` can be used to change the maximum frequency targeted by the
53+
design. If unspecified, the default of the target platform will be used.
54+
55+
For more details on how to configure the project to achieve high throughput,
56+
see our paper [1].
57+
58+
## Compilation
59+
60+
The minimum commands necessary to configure and build the code are:
61+
62+
```bash
63+
mkdir build
64+
cd build
65+
cmake .. # Default parameters
66+
make # Builds software components
67+
make hw # Builds hardware accelerator
68+
```
69+
70+
However, the accelerator should always be configured to match the target system
71+
using the parameters described in the previous section and in our paper [1].
72+
The CMake configuration flow uses
73+
[hlslib](https://github.com/definelicht/hlslib) [3] to locate the Xilinx tools
74+
and expose hardware build targets.
75+
76+
The project depends on Vitis, GMP, and MPFR to successfully configure.
77+
78+
## Running the code
79+
80+
We provide an example host code that runs the matrix multiplication accelerator
81+
on a randomized input in `host/TestMatrixMultiplication.cpp`. See the executable
82+
for usage. An example invocation could be:
83+
84+
```bash
85+
./TestMatrixMultiplicationHardware hw 256 256 256
86+
```
87+
88+
## Installation
89+
90+
To install the project, including both the software interface components and the
91+
hardware accelerator itself (built with `make hw`), simply run `make install`.
92+
The location to install the project in is configured with the
93+
`CMAKE_INSTALL_PREFIX` parameter.
94+
95+
## References
96+
97+
[1] Johannes de Fine Licht, Christopher A. Pattison, Alexandros Nikolaos
98+
Ziogas, David Simmons-Duffin, Torsten Hoefler, _"Fast Arbitrary Precision
99+
Floating Point on FPGA"_, in Proceedings of the 2022 IEEE 30th Annual
100+
International Symposium on Field-Programmable Custom Computing Machines
101+
(FCCM'22). [🔗](https://spcl.inf.ethz.ch/Publications/.pdf/apfp.pdf)
102+
103+
[2] Johannes de Fine Licht, Grzegorz Kwasniewski, and Torsten Hoefler,
104+
_"Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level
105+
Synthesis"_, in Proceedings of 28th ACM/SIGDA International Symposium on
106+
Field-Programmable Gate Arrays (FPGA'20).
107+
[🔗](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf)
108+
109+
[3] Johannes de Fine Licht, and Torsten Hoefler. _"hlslib: Software Engineering
110+
for Hardware Design."_, presented at the Fifth International Workshop on
111+
Heterogeneous High-performance Reconfigurable Computing (H2RC'19).
112+
[🔗](https://spcl.inf.ethz.ch/Publications/.pdf/hlslib.pdf)

0 commit comments

Comments
 (0)