Skip to content

Commit 18ba443

Browse files
committed
new docs article
1 parent e41569f commit 18ba443

24 files changed

+3825
-14
lines changed

docs/.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
node_modules/

docs/content/index.md

+3-6
Original file line numberDiff line numberDiff line change
@@ -6,20 +6,17 @@ This site hosts documentation on [punt-engine](https://github.com/raquentin/punt
66

77
## Objective
88

9-
As a team of undergraduates on a $200 FPGA with a bankroll stimmed largely by microstakes poker winnings, we can safely attribute any profits generated from this infrastructure to good fortune. A better goal here is to build some of the first well-documented open source software in this genre.
9+
This project is maintained by undergraduates. We use a $200 FPGA and have no bankroll. The main goal here is to build some of the first well-documented open source software in this genre.
1010

1111
## Overview
12-
13-
![[images/punt-engine-systems-design.png]]
14-
15-
Understand more of this image in [[notes/the-punt-engine-system|The Punt Engine system]].
12+
See [[notes/the-punt-engine-system|The Punt Engine system]].
1613

1714
## Contributing
1815

1916
All contributors are welcome. Begin by reading some of this site and looking through [open issues tagged "help wanted"](https://github.com/raquentin/punt-engine/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22).
2017

2118
Many modules within our monorepo have dedicated pages here listing dependencies and local dev instructions, find them using the search bar.
2219

23-
There will inevitably be holes in the docs. If you have questions or want guidance, we hold office hours in [our Discord](https://discord.gg/2A2dpwfxBF) on Tuesday, Thursday, and Saturday from 9:30am to 12:00pm EDT/UTC-4.
20+
There will inevitably be holes in the docs. If you have questions or want guidance, we hold office hours in [our Discord](https://discord.gg/2A2dpwfxBF).
2421

2522
You can direct other questions to [email protected].
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,343 @@
1+
---
2+
title: Building a moving average accumulator core
3+
date: 02-24-2025
4+
---
5+
6+
This post guides readers from a blank folder to a working and tested moving average accumulator gateware circuit. I'll first describe the intuition for the algorithm, then implement it in SystemVerilog, then write a Verilator testbench in C++.
7+
8+
Exactly what we're building is a clocked circuit that takes in a price and returns the average price over the last $n$ samples. We'll parameterize `k` and the bit width of the price, namely `n`, using the SystemVerilog `parameter` keyword.
9+
10+
## Setup
11+
12+
This post is a coding exercise. Create a new folder with nothing in it.
13+
14+
You'll need these dependencies:
15+
- Verilator
16+
- A C++ compiler
17+
- CMake
18+
19+
Our project uses Nix for this. You're recommended to use our [[https://github.com/raquentin/punt-engine/blob/main/flake.nix| Nix Flake developer environment]] to get the exact versions our project builds with. You can delete all the Haskell/Clash stuff, we're just working with Verilog today. Otherwise just download the deps yourself.
20+
21+
Create these files:
22+
- `Makefile`: We'll use CMake to control more complex Verilator and `make` commands.
23+
- `mov_avg_acc.sv`: This is the moving average accumulator. We'll implement it in SystemVerilog.
24+
- `testbench.cpp`: We compare the output of a software testbench to that of our accumulator core to ensure it workstestbench.cpp`:
25+
26+
Now you're setup. Let's begin.
27+
28+
## Intuition
29+
30+
Consider the simplest interface to an accumulator circuit. Like any clocked circuit, we'll have an `input wire clk`. Likewise, we'll have an `input wire reset`.
31+
32+
That's the foundation, but what is the real i/o for this circuit? We take in a stream of prices. What we're calling a price is a value represented by a bus of $n$ wires. These wires together allow the representation of $2^n - 1$ values. In our case, these values are prices. We'll have an `input wire unsigned d_in`, which holds the price that we are to sample on the next rising clock edge. We don't want to lock users of this circuit to only a certain $n$ value; our core should be agnostic to the bit depth of the price of the asset that it is accumulating. So we'll abstract away this $n$ using a parameter called `DataWidth`.
33+
34+
With these inputs, we perform some logic, and produce an output. This output is going to a price, so it will be the same type as the `d_in` wire.
35+
36+
Here's what that looks like in `mov_avg_acc.sv`:
37+
```v
38+
module mov_avg_acc #(
39+
parameter integer Exponent /*verilator public*/ = 3,
40+
parameter integer DataWidth /*verilator public*/ = 16
41+
) (
42+
input wire clk,
43+
input wire reset,
44+
input wire unsigned [DataWidth-1:0] d_in,
45+
output reg unsigned [DataWidth-1:0] d_out
46+
);
47+
```
48+
49+
Let's break down this syntax a bit:
50+
- `module mov_avg_acc` defines a reusable hardware component called mov_avg_acc. Other modules can embed this module, linking their wires to the inputs and outputs of our core.
51+
- Next is a list bound by `#(...)`. This is the parameter list, here go constants like Exponent ($log_2(k)$) and DataWidth ($n$). When synthesizing our mov_avg_acc core, we can pass these parameters in without having to go in and change the internal RTL of our circuit.
52+
- Finally is this list bound by `(...)`. Within this list are all i/o. In our case, we have the three inputs and one output described above.
53+
54+
### Circular buffers
55+
56+
Let's ignore the interface described above and generalize the moving average problem a bit. We have a stream of prices over time, and want to maintain the average over the last $k$ of them. Naively, computing an average involves summing up all samples in a set and dividing that sum by the size of the set.
57+
58+
Consider an implementation of the circuit that maintains an stack-like array structure of infinite size. On each clock edge, we push the current price (the value of `d_in`) on top of this stack. Then we sum up the top $k$ elements on the stack, divide that sum by $k$, and put that result into the register `d_out`.
59+
60+
The problem with this is that we don't infinite memory. So instead of an infinitely large stack, we'll maintain an array of size $k$, and a pointer to the oldest index in $k$, called `oldest_index`. Since we're only concerned with the $k$ most recent samples, on a given clock edge, the price in the array at `oldest_index` was $k$ samples ago and hence does not impact the value we push to `d_out`. So on this new edge, we can just overwrite `array[oldest_index]` with `d_in`, fixing the infinite memory problem.
61+
62+
So now, we have an array in our circuit with the $k$ most recent samples. On each cycle, we could sum up the array in $O(k)$ time, then divide by $k$ and be done. That's a bit naive though, we can sum the array in $O(1)$ time by realizing that on a given cycle, we increase the sum by `d_in` and decrease it by `array[oldest_index]`, AKA the price $k$ samples ago. So instead of summing up the whole array, we just subtract by the oldest sample and add the newest one: `d_in`. Then we divide by $k$.
63+
64+
### An optimization for $k = 2^n$.
65+
66+
There's one last optimization. Imagine that in the circuit above $k$ is 10; we're calculating the average over the last 10 samples. Dividing by 10 on computers is slow and difficult. We either use a floating point unit and deal with fractions of the price, or use fast division algorithms. Look up hardware division algorithms if you care to learn more, but the point here is that they are slow and we want to avoid them when possible.
67+
68+
When $k = 2^n$ (for some $n \in \mathbb{R}$, not the $n$ from before related to bit depth), we can use the right shift operation to divide the sum by $k$ in less than one clock cycle. Instead of performing some series of operations to implement a general division algorithm, we essentially just ignore the least significant $k$ bits of the sum.
69+
70+
## Implementation
71+
72+
You've learned the algorithm. Let's implement it. There's a few things to note first before we get into the full code:
73+
74+
### The bit width of the sum of `array` $!= n$
75+
76+
Recall that the `array` consists of $k$ elements of size $n$ (`[DataWidth-1:0]`). In the calculation of `d_out`, we can't simply store this sum in an $n$-bit wire because we will have overflow. To prevent overflow, our temporary `sum` wire must be of size $k+n$. So in our circuit we have:
77+
```v
78+
reg unsigned [AccWidth-1:0] acc;
79+
```
80+
Note that this is not a `reg` in the circuit's i/o list; it's a local/intermediate `reg`.
81+
82+
### `localparam`
83+
84+
For convenience, we have:
85+
```v
86+
localparam integer N = 1 << Exponent;
87+
localparam integer AccWidth = DataWidth + Exponent;
88+
```
89+
90+
`N` defines the aforementioned $n$ but uppercase to match formatting standards. `AccWidth` is the bit depth of the `acc` reg, which is the numerator in the average calculation, AKA the sum of the elements of `array`. It's of depth $k+n$.
91+
92+
The `localparam` keybord is a bit like C's `#define`. We setup these aliases to avoid pasting those rvalues all over the place.
93+
94+
### `for` loops in hardware description languages
95+
96+
In the code below, you'll see a loop that instantiates the elements of the `array` of sample history, here named `sample_buffer`:
97+
```v
98+
for (i = 0; i < N; i++) begin
99+
sample_buffer[i] <= '0;
100+
end
101+
```
102+
103+
Remember that we're describing hardware, not writing software. `for` here is not the imperative `for` loop that it is in software. Here, it essentially is a compile-time `for` that expands into an assignment of each element of the sample buffer to `0` without having to copy-paste that assignment for each hardcoded `i` upto `N`. This loop does NOT perform some type of iterative operation at runtime.
104+
105+
106+
### Sign extension
107+
108+
Recall the operation where we subtract out the oldest sample and add in the newest. Here's the code for that:
109+
```verilog
110+
acc <= acc
111+
- {
112+
{(AccWidth - DataWidth){sample_buffer[oldest_index][DataWidth-1]}},
113+
sample_buffer[oldest_index]
114+
}
115+
+ {{(AccWidth - DataWidth){d_in[DataWidth-1]}}, d_in};
116+
```
117+
118+
In short, we're setting `acc <= acc - subtrahend + addend`.
119+
120+
In the subtrahend, `{(AccWidth - DataWidth){sample_buffer[oldest_index][DataWidth-1]}}` performs a sign-extension. That's a bit confusing though since all these values are unsigned. It's really just a pattern to extend `sample_buffer[oldest_index]` to the size of the larger `acc` to allow them to be subtracted. It takes the most significant bit of the oldest sample and replicated it upto the size of `acc`, then concatenates that with the oldest sample itself.
121+
122+
Similarly, `{{(AccWidth - DataWidth){d_in[DataWidth-1]}}, d_in}` expands `d_in` to the width of `acc`.
123+
124+
## The entire SystemVerilog core
125+
126+
With the intuition and notes out of the way, we've almost accidentally built the whole circuit. It's pasted below with comments inline.
127+
128+
```verilog
129+
module moving_average_accumulator #(
130+
// we use Exponent to enforce that k = 2^n
131+
parameter integer Exponent /*verilator public*/ = 3,
132+
parameter integer DataWidth /*verilator public*/ = 16 // n
133+
) (
134+
input wire clk,
135+
input wire reset,
136+
input wire unsigned [DataWidth-1:0] d_in,
137+
output reg unsigned [DataWidth-1:0] d_out
138+
);
139+
140+
localparam integer K = 1 << Exponent;
141+
localparam integer AccWidth = DataWidth + Exponent;
142+
143+
// holds sum of last `N` samples
144+
reg unsigned [AccWidth-1:0] acc;
145+
146+
// circular buffer storing last k samples
147+
reg unsigned [DataWidth-1:0] sample_buffer[K];
148+
149+
// index of oldest element in `sample_buffer`
150+
integer oldest_index;
151+
152+
// loop
153+
integer i;
154+
155+
always_ff @(posedge clk or posedge reset) begin
156+
if (reset) begin
157+
acc <= '0;
158+
d_out <= '0;
159+
160+
// reset sample buffer
161+
for (i = 0; i < K; i++) begin
162+
sample_buffer[i] <= '0;
163+
end
164+
165+
oldest_index <= 0;
166+
end else begin
167+
// subtract oldest, add newest
168+
acc <= acc
169+
- {
170+
{(AccWidth - DataWidth){sample_buffer[oldest_index][DataWidth-1]}},
171+
sample_buffer[oldest_index]
172+
}
173+
+ {{(AccWidth - DataWidth){d_in[DataWidth-1]}}, d_in};
174+
175+
// overwrite oldest sample
176+
sample_buffer[oldest_index] <= d_in;
177+
178+
// inc oldest_index. wrap around using modulo K
179+
oldest_index <= (oldest_index + 1) % K;
180+
181+
// compute moving average by dividing by 2^exponent
182+
d_out <= acc[AccWidth-1:Exponent]; // like a right shift
183+
end
184+
end
185+
186+
endmodule
187+
```
188+
189+
## Testbench
190+
191+
We have the gateware core completed; now let's test it. If we can assume some knowledge of C++, I can just explain the testbench in comments:
192+
193+
```cpp
194+
// verilator compiles our mov_avg_acc.sv to cpp, that's the file below
195+
#include "Vmoving_average_accumulator.h"
196+
// we'll instantiate it in our function, run some inputs through it, and compare
197+
198+
#include "verilated.h"
199+
#include "verilated_vcd_c.h"
200+
#include <cstdlib>
201+
#include <iostream>
202+
#include <vector>
203+
204+
#define SIM_TIME 20 // Simulation time in clock cycles
205+
206+
int main(int argc, char **argv) {
207+
// setup
208+
Verilated::commandArgs(argc, argv);
209+
// top is a C++ version of the SystemVerilog core we wrote
210+
Vmoving_average_accumulator *top = new Vmoving_average_accumulator;
211+
VerilatedVcdC *tfp = nullptr;
212+
vluint64_t sim_time = 0;
213+
214+
// these are the input wires we had in out circuit.
215+
// we're setting them up
216+
// reset->1 sets up the sample_buffer
217+
top->clk = 0;
218+
top->reset = 1;
219+
top->d_in = 0;
220+
221+
// what we're really doing from here down is setting up the "expected" values
222+
// that we'll compare to the output of our verilog circuit.
223+
// to get these expected values, we need to rewrite our circuit's logic in C++
224+
225+
// same constants from sv
226+
const int Exponent = 3;
227+
const int N = 1 << Exponent;
228+
const int DataWidth = 16;
229+
230+
// our C++ version doesn't naturally respond to the clock
231+
// and hence doesn't have a delay in its output
232+
// we store the values like this to mock this delay
233+
uint16_t one_ago = 0;
234+
uint16_t two_ago = 0;
235+
uint16_t expected_out = 0;
236+
237+
std::vector<uint16_t> sample_buffer(N, 0);
238+
uint32_t acc = 0; // using 32 bits to prevent overflow
239+
240+
// current index pointing to the oldest sample
241+
int oldest_index = 0;
242+
243+
// simulation loop
244+
while (sim_time < SIM_TIME) {
245+
// invert clock
246+
top->clk = !top->clk;
247+
248+
// allow time for reset
249+
if (sim_time > 4)
250+
top->reset = 0;
251+
252+
// the real logic for the rising edge of the clock
253+
if (top->clk) {
254+
// pseudorandom input price
255+
uint16_t input_data = (3 * sim_time + 2) % 8031;
256+
top->d_in = input_data; // assert d_in with that price
257+
258+
// remake the logic in C++
259+
if (!top->reset) {
260+
acc -= sample_buffer[oldest_index];
261+
acc += input_data;
262+
sample_buffer[oldest_index] = input_data;
263+
264+
oldest_index = (oldest_index + 1) % N;
265+
266+
// mock the delay
267+
two_ago = one_ago;
268+
one_ago = expected_out;
269+
expected_out = acc >> Exponent;
270+
271+
// compare and err if bad
272+
// if the sim ends without error, we passed
273+
if (top->d_out != two_ago) {
274+
std::cerr << "Mismatch on simtime" << sim_time << ": expected "
275+
<< two_ago << ", got " << top->d_out << std::endl;
276+
return EXIT_FAILURE;
277+
}
278+
}
279+
}
280+
281+
top->eval();
282+
283+
284+
sim_time++;
285+
}
286+
287+
top->final();
288+
289+
// cleanup
290+
delete top;
291+
292+
// test passed
293+
std::cout << "Simulation completed successfully." << std::endl;
294+
return EXIT_SUCCESS;
295+
}
296+
```
297+
298+
## Makefile
299+
300+
In order to run this locally and in CI, we'll use this `Makefile`:
301+
```make
302+
# Top-level module name
303+
TOP_MODULE = moving_average_accumulator
304+
305+
# Verilog source files
306+
VERILOG_SOURCES = $(TOP_MODULE).sv
307+
308+
# C++ testbench file
309+
TESTBENCH = tb.cpp
310+
311+
# Output directory
312+
OBJ_DIR = obj_dir
313+
314+
# Simulation executable
315+
SIM_EXE = $(OBJ_DIR)/V$(TOP_MODULE)
316+
317+
# Default target
318+
.PHONY: all
319+
all: run_simulation
320+
321+
# Compile Verilog and C++ sources
322+
$(SIM_EXE): $(VERILOG_SOURCES) $(TESTBENCH)
323+
verilator --cc $(VERILOG_SOURCES) --exe $(TESTBENCH) --trace
324+
make -j -C $(OBJ_DIR) -f V$(TOP_MODULE).mk V$(TOP_MODULE)
325+
326+
# Run the simulation
327+
.PHONY: run_simulation
328+
run_simulation: $(SIM_EXE)
329+
./$(SIM_EXE)
330+
331+
# Clean up generated files
332+
.PHONY: clean
333+
clean:
334+
rm -rf $(OBJ_DIR) waveform.vcd
335+
336+
# Phony target for CI testing (exits with non-zero status on failure)
337+
.PHONY: test
338+
test: run_simulation
339+
```
340+
341+
## Running it
342+
343+
That's all the code. Let's run it. If you have the dependecies setup, run `make test`.

0 commit comments

Comments
 (0)