Welcome to the open-sourced repository for our FPGA-based MDLM (Masked Diffusion Language Model) Accelerator built within the Allo framework!
This ongoing project focuses on developing an efficient end-to-end accelerator for diffusion language models on FPGAs using the Allo framework and High Level Synthesis (HLS). Allo is a accelerator design language (ADL) for efficient spatial accelerator design. The specific diffusion language model is based on Simple and Effective Masked Diffusion Language Models.
For more detailed information about the background and preliminaries of the diffusion mechanism, diffusion language model, and related hardware accelerators, please refer to the following document:
For profiling results, the roofline model, the MDLM accelerator implementation details, and suggested features for Allo, please refer to the MDLM Accelerator Document:
Before using this repository, please ensure you have the following prerequisites satisfied:
- Python and Pytorch Framework
- Toolchain: Xilinx Vitis v2022.1
- Platform: Xilinx Alveo U280
- Compiler Framework: Allo - Install here
This project includes Allo as a Git submodule. We are using Allo at commit b1f6772
.
After cloning this repository, ensure you have the correct version by running:
git submodule update --init --recursive
This ensures that you are using the exact version of Allo required to reproduce this project.
MDLM/
│── Baseline/ # Auto-generated baseline implementation (Allo)
│ ├── Allo_DDitBlock.prj # Baseline HLS project
│── Optimized/ # Optimized FPGA implementation
│── allo_code/ # Allo implementation of DDitBlock and software verification
│── configs/ # MDLM Pytorch model configuration files
│── documentation/ # Project documentations
│ ├── background.md
│ ├── mdlmaccelerator.md
│── pytorch_code/ # PyTorch implementation of MDLM & DDitBlock
│── LICENSE # License information
│── readme.md # Project readme
The pytorch model could be found under (pytorch_code/
). This contains the full MDLM model and its core component, DDitBlock. To satisfy FPGA deployment constraints with limited resources, we use a tiny model, where DiT operates on tensors of shape [1024, 512]
.
The Baseline implementation (auto-generated by Allo) is under Baseline/
. However, it does not meet on-chip storage requirements.
We also provide the code of the Optimized version (Optimized/
), with improvements in both latency and memory efficiency.
Clone this project and swtich to the main path.
git clone https://github.com/silvenachen/FPGA-based-Accelerator-for-Diffusion-Language-Models.git
cd FPGA-based-Accelerator-for-Diffusion-Language-Models
To be able to generate HLS code, you need to have Allo installed in your local environment. Go to ./allo/library/nn.py
, replace nn.py
with ./allo_lib/nn.py
in our project, which is an updated library file with specialized DiT operators.
Next, run python DDitBlock_Allo_Kernel.py
, which will automatically generate the HLS project. Alternatively, a pre-built project is available at: ./Baseline/Allo_DDitBlcok.prj
.
For the optimized version, please refer to ./Optimized/
, where you could use run simulations and experimenting with the kernel deployment.
For our hardware-side tests, all experiments are conducted on the AMD Alveo U280 FPGA using Vitis v2022.1, currently with a target frequency of 100MHz. The U280 FPGA is equipped with 4032 BRAM 18K blocks, 9024 DSP slices, 2.6M flip-flops, 1.3M LUTs, and 960 URAM blocks.
We present the results for latency and resource usage comparison between the baseline and optimized implementations. The table below summarizes the usage of different resources, along with latency improvements (Currently acquired through HLS synthesis. A cycle-accurate result is expected through Cosim and RTL snythesis).
Version | Latency (ns) | BRAM | DSP | FF | LUT | URAM |
---|---|---|---|---|---|---|
Baseline | 5.1E9 | 36693 (910%) | 2791 (30%) | 273231 (11%) | 369770 (28%) | - |
Optimized | 4.127E9 | 1532 (37%) | 1186 (13%) | 126044 (4%) | 170984 (13%) | 768 (80%) |
- The optimized version shows significant reductions in resource consumption and improvements in performance, specifically in BRAM and DSP usage with our memory copy and resource reusing techniques. For more implementation details, please refer to MDLM Accelerator Documentation.
For detailed optimization methods and Allo feature suggestions, please check Optimization Techniques and Allo Feature Suggestions.
This project is currently in progress and developed by Shuyang Li ([email protected])
Under the guidance of Professor Zhiru Zhang and Ph.D. student Yixiao Du when interning at Cornell University.
For questions or collaborations, feel free to contact Shuyang via email!