CDA FPGA LLM Accelerator — Matrix Multiplication Demo

A complete FPGA-based matrix-multiplication accelerator targeting the Tang Primer 20K (GW2A-18, Gowin toolchain) with a PC host Python stack.

Repository Layout

fpga_matmul/
├── rtl/
│   ├── uart_rx.v        — UART receiver (8N1, parameterised baud)
│   ├── uart_tx.v        — UART transmitter
│   ├── matmul_core.v    — Parameterised sequential MAC engine (N=4/8/16)
│   ├── uart_ctrl.v      — Protocol controller (glues UART ↔ matmul_core)
│   └── top.v            — Chip top-level (clock, LED heartbeat)
├── sim/
│   ├── tb_matmul_core.v — Self-checking testbench for matmul_core
│   └── tb_uart.v        — UART TX→RX loopback testbench
├── constraints/
│   ├── top.cst          — Physical pin constraints (Gowin)
│   └── top.sdc          — Timing constraints
└── host/
    ├── cpu_baseline.py  — Measure CPU-only matmul time
    ├── fpga_host.py     — Send matrices to FPGA, verify, benchmark
    ├── fpga_sim.py      — Software FPGA simulator (no hardware needed)
    └── analyze_results.py — Tables + plots from benchmark data

Quick Start

1. Simulate (no hardware)

# Install Icarus Verilog
sudo apt install iverilog   # Ubuntu / WSL

# Compile and run matmul testbench
cd sim
iverilog -o sim_matmul tb_matmul_core.v ../rtl/matmul_core.v
vvp sim_matmul

# Compile and run UART loopback testbench
iverilog -o sim_uart tb_uart.v ../rtl/uart_rx.v ../rtl/uart_tx.v
vvp sim_uart

Expected output:

--- Test 1: I x B ---   cycles=84
--- Test 2: 1s x 1s --- cycles=84
ALL TESTS PASSED

2. Synthesise (Gowin IDE)

see INSTRUCTIONS.md
To change matrix size: open top.v and change parameter N = 4 to 8 or 16. (this is specified in the command line for fpga_host.py right?)

Check the synthesis report for:

LUT count
DSP block count
Max frequency (should be well above 27 MHz for N=4/8)

3. Run CPU Baseline

cd host
pip install numpy
python cpu_baseline.py

4. Run with Real FPGA

pip install pyserial numpy
python fpga_host.py --port /dev/ttyUSB0 --N 4 --iters 20

Common ports:

Linux: /dev/ttyUSB0 or /dev/ttyACM0
macOS: /dev/cu.usbserial-XXXX
Windows: COM3 (check Device Manager)

5. Run with Software Simulator (no hardware)

# Linux/macOS – create virtual serial pair
sudo apt install socat
socat -d -d PTY,raw,echo=0,link=/tmp/fpga_sim PTY,raw,echo=0,link=/tmp/fpga_host &

# Terminal 1
python fpga_sim.py --port /tmp/fpga_sim --N 4

# Terminal 2
python fpga_host.py --port /tmp/fpga_host --N 4 --iters 5

6. Analyze and Plot Results

python analyze_results.py --demo          # synthetic data
python analyze_results.py --results my_results.json  # your data

Design Notes

FPGA Architecture

PC ──UART──► [uart_rx] ──bytes──► [uart_ctrl FSM]
                                        │
                              ┌─────────┴──────────┐
                              │   matmul_core       │
                              │  (sequential MAC)   │
                              │  Latency ≈ N³ clks  │
                              └─────────┬──────────┘
                                        │
PC ◄──UART── [uart_tx] ◄──bytes── [uart_ctrl FSM]

matmul_core iterates three nested counters (i, j, k) and feeds one A[i][k] * B[k][j] into a 32-bit accumulator each clock cycle. Total compute cycles ≈ N³ + N² (write-back overhead).

Protocol

Direction	Content	Size
PC → FPGA	Command byte `0x01`	1 B
PC → FPGA	Matrix A (int16 LE, row-major)	N×N×2 B
PC → FPGA	Matrix B (int16 LE, row-major)	N×N×2 B
FPGA → PC	Matrix C (int32 LE, row-major)	N×N×4 B
FPGA → PC	Cycle count (uint32 LE)	4 B

Expected Performance (27 MHz clock, 115200 baud)

N	FPGA cycles	FPGA time	UART overhead	CPU naive
4	~84	~3.1 µs	~3.0 ms	~4 µs
8	~584	~21.6 µs	~9.4 ms	~42 µs
16	~4368	~162 µs	~36 ms	~580 µs

Note: UART dominates end-to-end time. The compute-only speedup (from the cycle counter) is what shows the FPGA advantage. A higher baud rate (e.g. 921600) or SPI/parallel bus would make end-to-end competitive too.

Scaling to N=16

N=16 requires 256 MAC units or 4096 sequential cycles.
With a sequential design the resource usage stays minimal (~100 LUTs, a few BRAMs, 1 DSP). Synthesis at 27 MHz should close timing easily.

For the DSP-parallel design (16 MACs in parallel), you'd need 16–64 DSP18 blocks – the GW2A-18 has 48, so N=4 parallel fits comfortably.

Troubleshooting

Symptom	Likely cause	Fix
No RX data from FPGA	Wrong port / baud	Check Device Manager / `dmesg`
Mismatched results	Byte-order bug	Verify LE packing in host script
Simulation hangs	Off-by-one in FSM	Check `N-1` comparisons in matmul_core
Synthesis fails timing	Clock too high	Lower to 27 MHz (already default)
LED not blinking	Wrong pin in .cst	Check board schematic

Dependencies

Hardware: Tang Primer 20K, USB cable, PC with Gowin IDE.

Python: numpy, pyserial, matplotlib (optional for plots).

pip install numpy pyserial matplotlib

Simulation: Icarus Verilog (free, cross-platform).

Reference

TangPrimer-20K-example

Useful VS Code Extensions

.cst file syntax highlighting: wxhenry.fpga-file-support
.v file syntax highlighting: mshr-h.veriloghdl

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Documentation		Documentation
constraints		constraints
host		host
rtl		rtl
sim		sim
INSTRUCTIONS.md		INSTRUCTIONS.md
README.md		README.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDA FPGA LLM Accelerator — Matrix Multiplication Demo

Repository Layout

Quick Start

1. Simulate (no hardware)

2. Synthesise (Gowin IDE)

3. Run CPU Baseline

4. Run with Real FPGA

5. Run with Software Simulator (no hardware)

6. Analyze and Plot Results

Design Notes

FPGA Architecture

Protocol

Expected Performance (27 MHz clock, 115200 baud)

Scaling to N=16

Troubleshooting

Dependencies

Reference

Useful VS Code Extensions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CDA FPGA LLM Accelerator — Matrix Multiplication Demo

Repository Layout

Quick Start

1. Simulate (no hardware)

2. Synthesise (Gowin IDE)

3. Run CPU Baseline

4. Run with Real FPGA

5. Run with Software Simulator (no hardware)

6. Analyze and Plot Results

Design Notes

FPGA Architecture

Protocol

Expected Performance (27 MHz clock, 115200 baud)

Scaling to N=16

Troubleshooting

Dependencies

Reference

Useful VS Code Extensions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages