Course: ECE 327 – Digital Hardware Systems (Dr. Nachiket Kapre)
Target: Xilinx PYNQ-Z1 FPGA
Language: SystemVerilog (RTL)
Theme: Integer-friendly transformer primitives and a minimal end-to-end token generator on FPGA
Huge thank you to Dr. Nachiket Kapre for making this course possible at the University of Waterloo!
- Clock: positive edge.
- Reset (
rst): synchronous, active-high. - Initialize (
initialize/init): starts a fresh accumulation/transaction without discarding the current input. - Streaming: ready/valid (AXI-Stream-like:
tdata,tvalid,tready,tlast). - Latency accounting: all pipelines document stage depth; integration aligns via shift registers/FIFOs.
-
ACC (Accumulator): Sums a stream
in_dataintoresulteach clock; synchronous reset clears to 0. Wheninitialize=1, it seedsresultwith the current sample (start fresh without dropping it). Latency: 1 cycle. -
MAC (Multiply-Accumulate): Computes a dot product by accumulating
a*beach cycle.initialize=1seeds the sum witha*b; reset clears to 0. -
MAX (Running Maximum): Tracks the largest input seen so far; updates only when a new max arrives.
initializeseeds with current input; reset clears to 0. -
ARRAY (N-lane MACs): Instantiates N independent
maclanes (a[k],b[k],initialize[k]) to produce parallel dot-product partial sums as a vector.
Context: Integer math + polynomial approximations for FPGA efficiency. EXP/GELU are deep pipelines; DIV is FSM (multi-cycle).
- Goal:
quotient = floor(dividend / divisor). - FSM:
INIT → COMP- INIT: wait for
in_valid; latch dividend/divisor; clear quotient. - COMP: across cycles, compute LOPD (leading-one position) on remainder and divisor to choose optimal shift; subtract aligned divisor; update quotient; repeat.
- Done: when remainder < divisor → assert
out_valid, return toINIT.
- INIT: wait for
- Why FSM: small area, predictable latency; avoids large combinational divider.
- Approach: 2nd-order polynomial approximation of
exp(x)with integer/fixed-point scaling. - Pipeline: multiple stages for full throughput (new input every cycle).
- Trade-off: small accuracy loss vs. large area/speed savings.
- Approach: integer-friendly polynomial/tanh-style approximation.
- Pipeline: full throughput—after the pipeline fills, it can accept a new input every clock and produce a result every clock.
- Why: smoother than ReLU; works well under quantization (iBERT-style).
- Computation: each PE cell computes a multiply and accumulate:
out_data = in_data + in_a * in_b. - Chaining: can start fresh (
init=1) or add to a prior partial sum (in_data). - Pipelining: inputs are registered each cycle; once filled, the PE accepts a new input every clock and streams results every clock.
- Latency: one cycle per stage (steady state produces outputs every clock).
Suggested diagram: single-PE datapath (multiplier → adder → register) with pipeline registers.
- Dataflow: A grid of PEs;
Aflows left→right,Bflows top→down; partial sums move across rows one hop per cycle. - Output: row results emerge from the rightmost PEs.
- Goal: after the array is filled, sustain one new computation per PE per clock with outputs streaming each cycle.
- Row-interleaved banking: distribute rows across N1 banks (0..N1-1), then wrap.
- Row 0 → bank 0, Row 1 → bank 1, …, Row N1-1 → bank N1-1, Row N1 → bank 0, etc.
- Ring counter: cycles 0..N1-1 to select bank per new row.
- Base address bump: after a full stripe of N1 rows, increase base by M2 so the next row for bank 0 appends correctly.
- Traversal:
colincrements each cycle; on wrap (M2-1 → 0),row++.
| Aspect | Shift Register (SREG) | FIFO (BRAM/Register) |
|---|---|---|
| Structure | Chain of FFs (1 FF/bit/stage) | Memory with read/write pointers |
| Resource | Linear in depth × width (FF heavy) | Uses BRAM for large depth; small ones in regs |
| Logic | Minimal | Pointer logic + flags (full/empty) |
| Best use | Small, fixed delays (latency alignment) | Deep buffers, rate matching, back-pressure absorption |
- Shift Registers (SREG): great for small, fixed delays; cheap control; poor for stalls or variable latency.
- FIFOs: handle back-pressure, rate mismatches, and variable latencies; map to BRAM for deeper queues → much more area-efficient.
- Choice: I switched to FIFOs to guarantee synchronized data flow across branches (especially around
exp/acc/div) and to propagate stalls cleanly.
- Flow:
max(per row/vec) → subtract-max →exp(pipelined poly) → quantize/shift → accumulate sum → divide to get scaling factor → multiply inputs by factor → optional output shift/quantize. - Buffering strategy:
- FIFO after
expto align with the slower sum/divide branch. - FIFO on factor path to rejoin with the original data stream for the final multiply.
- Frame with
tlastso reductions reset correctly.
- FIFO after
- Numerics: staged quantization after
expand after normalization; document rounding/saturation.
- 11 Stages: accumulate mean/variance → affine scale/shift.
- Reuse: ACC/MAC/MAX/DIV/EXP building blocks.
- Flow: accumulate sum and sum of squares → compute mean and variance → derive scale (1/std) → apply
(x − mean) * scale * gamma + beta. - Buffering strategy:
- FIFO on the raw
xstream to wait for mean/variance path (longer). - Optional FIFO after scale to absorb occasional stalls.
- Reset/flush on
tlastto partition rows/feature windows.
- FIFO on the raw
- Numerics: fixed-point mean/var; epsilon/shift to stabilize std; quantize after affine step if downstream precision is lower.
- Compute: Q, K, V projections via systolic GEMM → scaled dot-product (
Q·Kᵀ) → softmax → apply toV. - Interfaces: AXI-Stream shims on every block for clean composition.
- Top-level: systolic array + requant + layernorm + GELU + attention; minimal token emission path.
| Term | What It Measures | Typical Trade-offs / Notes |
|---|---|---|
| LUTs | Combinational logic | Wide muxes/big arithmetic grow LUTs; excessive LUTs = inefficient mapping. |
| Slice Registers (FFs) | 1-bit storage (pipelines/FSM) | More FFs → higher Fclk via pipelining; over-pipelining costs area. |
| DSPs | Hard MACs / multipliers | Prefer for big math; falling back to LUTs explodes LUT usage. |
| Slices | Physical grouping of LUTs+FFs | Poor packing if FF/LUT use is unbalanced; hurts P&R. |
| BRAM | On-chip memory | Use for large buffers/FIFOs; avoid building RAM from LUTs/FFs. |
- Pipelining: +FFs, often −LUT depth per stage → higher Fclk.
- Use DSPs for multipliers/adders; ensure inference isn’t blocked by bit-width or synthesis pragmas.
- Use BRAM for any meaningful buffer/table depth; leave regs for latency shims.
- Minimize unique control sets by sharing resets/enables where possible.
- Balance: watch Slice packing; skewed FF/LUT ratios reduce density.
| Metric | Value / Notes |
|---|---|
| Fclk (post-route) | XXX MHz |
| LUTs | XX,XXX |
| FFs | XX,XXX |
| BRAM (36Kb eq.) | XX |
| DSP48 | XX |
| Softmax throughput | N tokens/s (tile size T) |
| Attention latency | X cycles (Q/K/V in → out ready) |
| Power (est.) | X.X W @ YYY MHz |







