FPGA-based Accelerator for Diffusion Language Models

Welcome to the open-sourced repository for our FPGA-based MDLM (Masked Diffusion Language Model) Accelerator built within the Allo framework!

Overview

This ongoing project focuses on developing an efficient end-to-end accelerator for diffusion language models on FPGAs using the Allo framework and High Level Synthesis (HLS). Allo is a accelerator design language (ADL) for efficient spatial accelerator design. The specific diffusion language model is based on Simple and Effective Masked Diffusion Language Models.

Documentation

For more detailed information about the background and preliminaries of the diffusion mechanism, diffusion language model, and related hardware accelerators, please refer to the following document:

Background and Related Works

For profiling results, the roofline model, the MDLM accelerator implementation details, and suggested features for Allo, please refer to the MDLM Accelerator Document:

MDLM Accelerator Documentation

Prerequisites

Before using this repository, please ensure you have the following prerequisites satisfied:

Python and Pytorch Framework
Toolchain: Xilinx Vitis v2022.1
Platform: Xilinx Alveo U280
Compiler Framework: Allo - Install here

Allo Submodule

This project includes Allo as a Git submodule. We are using Allo at commit b1f6772. After cloning this repository, ensure you have the correct version by running:

git submodule update --init --recursive

This ensures that you are using the exact version of Allo required to reproduce this project.

Repository Structure

MDLM/
│── Baseline/                     # Auto-generated baseline implementation (Allo)
│   ├── Allo_DDitBlock.prj        # Baseline HLS project
│── Optimized/                     # Optimized FPGA implementation
│── allo_code/                      # Allo implementation of DDitBlock and software verification
│── configs/                        # MDLM Pytorch model configuration files
│── documentation/                   # Project documentations
│   ├── background.md                
│   ├── mdlmaccelerator.md           
│── pytorch_code/                     # PyTorch implementation of MDLM & DDitBlock
│── LICENSE                           # License information
│── readme.md                         # Project readme

Key Components

The pytorch model could be found under (pytorch_code/). This contains the full MDLM model and its core component, DDitBlock. To satisfy FPGA deployment constraints with limited resources, we use a tiny model, where DiT operates on tensors of shape [1024, 512].

The Baseline implementation (auto-generated by Allo) is under Baseline/. However, it does not meet on-chip storage requirements. We also provide the code of the Optimized version (Optimized/), with improvements in both latency and memory efficiency.

How to Use

Clone this project and swtich to the main path.

git clone https://github.com/silvenachen/FPGA-based-Accelerator-for-Diffusion-Language-Models.git
cd FPGA-based-Accelerator-for-Diffusion-Language-Models

Allo Reproduction

To be able to generate HLS code, you need to have Allo installed in your local environment. Go to ./allo/library/nn.py, replace nn.py with ./allo_lib/nn.py in our project, which is an updated library file with specialized DiT operators.

Next, run python DDitBlock_Allo_Kernel.py, which will automatically generate the HLS project. Alternatively, a pre-built project is available at: ./Baseline/Allo_DDitBlcok.prj.

For the optimized version, please refer to ./Optimized/, where you could use run simulations and experimenting with the kernel deployment.

Implementation

FPGA Experimental Setup

For our hardware-side tests, all experiments are conducted on the AMD Alveo U280 FPGA using Vitis v2022.1, currently with a target frequency of 100MHz. The U280 FPGA is equipped with 4032 BRAM 18K blocks, 9024 DSP slices, 2.6M flip-flops, 1.3M LUTs, and 960 URAM blocks.

Latency and Resource Optimizations

We present the results for latency and resource usage comparison between the baseline and optimized implementations. The table below summarizes the usage of different resources, along with latency improvements (Currently acquired through HLS synthesis. A cycle-accurate result is expected through Cosim and RTL snythesis).

Version	Latency (ns)	BRAM	DSP	FF	LUT	URAM
Baseline	5.1E9	36693 (910%)	2791 (30%)	273231 (11%)	369770 (28%)	-
Optimized	4.127E9	1532 (37%)	1186 (13%)	126044 (4%)	170984 (13%)	768 (80%)

The optimized version shows significant reductions in resource consumption and improvements in performance, specifically in BRAM and DSP usage with our memory copy and resource reusing techniques. For more implementation details, please refer to MDLM Accelerator Documentation.

For Allo Developers

For detailed optimization methods and Allo feature suggestions, please check Optimization Techniques and Allo Feature Suggestions.

Developers and Contact Information

This project is currently in progress and developed by Shuyang Li ([email protected])

Under the guidance of Professor Zhiru Zhang and Ph.D. student Yixiao Du when interning at Cornell University.

For questions or collaborations, feel free to contact Shuyang via email!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FPGA-based Accelerator for Diffusion Language Models

Overview

Documentation

Prerequisites

Allo Submodule

Repository Structure

Key Components

How to Use

Allo Reproduction

Implementation

FPGA Experimental Setup

Latency and Resource Optimizations

For Allo Developers

Developers and Contact Information

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Baseline/Allo_DDitBlock.prj		Baseline/Allo_DDitBlock.prj
Optimized		Optimized
allo @ b1f6772		allo @ b1f6772
allo_code		allo_code
configs		configs
documentation		documentation
pytorch_code		pytorch_code
.gitmodules		.gitmodules
LICENSE		LICENSE
readme.md		readme.md

License

silvenachen/FPGA-based-Accelerator-for-Diffusion-Language-Models

Folders and files

Latest commit

History

Repository files navigation

FPGA-based Accelerator for Diffusion Language Models

Overview

Documentation

Prerequisites

Allo Submodule

Repository Structure

Key Components

How to Use

Allo Reproduction

Implementation

FPGA Experimental Setup

Latency and Resource Optimizations

For Allo Developers

Developers and Contact Information

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages