|
| 1 | +--- |
| 2 | +title: Building a moving average accumulator core |
| 3 | +date: 02-24-2025 |
| 4 | +--- |
| 5 | + |
| 6 | +This post guides readers from a blank folder to a working and tested moving average accumulator gateware circuit. I'll first describe the intuition for the algorithm, then implement it in SystemVerilog, then write a Verilator testbench in C++. |
| 7 | + |
| 8 | +Exactly what we're building is a clocked circuit that takes in a price and returns the average price over the last $n$ samples. We'll parameterize `k` and the bit width of the price, namely `n`, using the SystemVerilog `parameter` keyword. |
| 9 | + |
| 10 | +## Setup |
| 11 | + |
| 12 | +This post is a coding exercise. Create a new folder with nothing in it. |
| 13 | + |
| 14 | +You'll need these dependencies: |
| 15 | +- Verilator |
| 16 | +- A C++ compiler |
| 17 | +- CMake |
| 18 | + |
| 19 | +Our project uses Nix for this. You're recommended to use our [[https://github.com/raquentin/punt-engine/blob/main/flake.nix| Nix Flake developer environment]] to get the exact versions our project builds with. You can delete all the Haskell/Clash stuff, we're just working with Verilog today. Otherwise just download the deps yourself. |
| 20 | + |
| 21 | +Create these files: |
| 22 | +- `Makefile`: We'll use CMake to control more complex Verilator and `make` commands. |
| 23 | +- `mov_avg_acc.sv`: This is the moving average accumulator. We'll implement it in SystemVerilog. |
| 24 | +- `testbench.cpp`: We compare the output of a software testbench to that of our accumulator core to ensure it workstestbench.cpp`: |
| 25 | + |
| 26 | +Now you're setup. Let's begin. |
| 27 | + |
| 28 | +## Intuition |
| 29 | + |
| 30 | +Consider the simplest interface to an accumulator circuit. Like any clocked circuit, we'll have an `input wire clk`. Likewise, we'll have an `input wire reset`. |
| 31 | + |
| 32 | +That's the foundation, but what is the real i/o for this circuit? We take in a stream of prices. What we're calling a price is a value represented by a bus of $n$ wires. These wires together allow the representation of $2^n - 1$ values. In our case, these values are prices. We'll have an `input wire unsigned d_in`, which holds the price that we are to sample on the next rising clock edge. We don't want to lock users of this circuit to only a certain $n$ value; our core should be agnostic to the bit depth of the price of the asset that it is accumulating. So we'll abstract away this $n$ using a parameter called `DataWidth`. |
| 33 | + |
| 34 | +With these inputs, we perform some logic, and produce an output. This output is going to a price, so it will be the same type as the `d_in` wire. |
| 35 | + |
| 36 | +Here's what that looks like in `mov_avg_acc.sv`: |
| 37 | +```v |
| 38 | +module mov_avg_acc #( |
| 39 | + parameter integer Exponent /*verilator public*/ = 3, |
| 40 | + parameter integer DataWidth /*verilator public*/ = 16 |
| 41 | +) ( |
| 42 | + input wire clk, |
| 43 | + input wire reset, |
| 44 | + input wire unsigned [DataWidth-1:0] d_in, |
| 45 | + output reg unsigned [DataWidth-1:0] d_out |
| 46 | +); |
| 47 | +``` |
| 48 | + |
| 49 | +Let's break down this syntax a bit: |
| 50 | +- `module mov_avg_acc` defines a reusable hardware component called mov_avg_acc. Other modules can embed this module, linking their wires to the inputs and outputs of our core. |
| 51 | +- Next is a list bound by `#(...)`. This is the parameter list, here go constants like Exponent ($log_2(k)$) and DataWidth ($n$). When synthesizing our mov_avg_acc core, we can pass these parameters in without having to go in and change the internal RTL of our circuit. |
| 52 | +- Finally is this list bound by `(...)`. Within this list are all i/o. In our case, we have the three inputs and one output described above. |
| 53 | + |
| 54 | +### Circular buffers |
| 55 | + |
| 56 | +Let's ignore the interface described above and generalize the moving average problem a bit. We have a stream of prices over time, and want to maintain the average over the last $k$ of them. Naively, computing an average involves summing up all samples in a set and dividing that sum by the size of the set. |
| 57 | + |
| 58 | +Consider an implementation of the circuit that maintains an stack-like array structure of infinite size. On each clock edge, we push the current price (the value of `d_in`) on top of this stack. Then we sum up the top $k$ elements on the stack, divide that sum by $k$, and put that result into the register `d_out`. |
| 59 | + |
| 60 | +The problem with this is that we don't infinite memory. So instead of an infinitely large stack, we'll maintain an array of size $k$, and a pointer to the oldest index in $k$, called `oldest_index`. Since we're only concerned with the $k$ most recent samples, on a given clock edge, the price in the array at `oldest_index` was $k$ samples ago and hence does not impact the value we push to `d_out`. So on this new edge, we can just overwrite `array[oldest_index]` with `d_in`, fixing the infinite memory problem. |
| 61 | + |
| 62 | +So now, we have an array in our circuit with the $k$ most recent samples. On each cycle, we could sum up the array in $O(k)$ time, then divide by $k$ and be done. That's a bit naive though, we can sum the array in $O(1)$ time by realizing that on a given cycle, we increase the sum by `d_in` and decrease it by `array[oldest_index]`, AKA the price $k$ samples ago. So instead of summing up the whole array, we just subtract by the oldest sample and add the newest one: `d_in`. Then we divide by $k$. |
| 63 | + |
| 64 | +### An optimization for $k = 2^n$. |
| 65 | + |
| 66 | +There's one last optimization. Imagine that in the circuit above $k$ is 10; we're calculating the average over the last 10 samples. Dividing by 10 on computers is slow and difficult. We either use a floating point unit and deal with fractions of the price, or use fast division algorithms. Look up hardware division algorithms if you care to learn more, but the point here is that they are slow and we want to avoid them when possible. |
| 67 | + |
| 68 | +When $k = 2^n$ (for some $n \in \mathbb{R}$, not the $n$ from before related to bit depth), we can use the right shift operation to divide the sum by $k$ in less than one clock cycle. Instead of performing some series of operations to implement a general division algorithm, we essentially just ignore the least significant $k$ bits of the sum. |
| 69 | + |
| 70 | +## Implementation |
| 71 | + |
| 72 | +You've learned the algorithm. Let's implement it. There's a few things to note first before we get into the full code: |
| 73 | + |
| 74 | +### The bit width of the sum of `array` $!= n$ |
| 75 | + |
| 76 | +Recall that the `array` consists of $k$ elements of size $n$ (`[DataWidth-1:0]`). In the calculation of `d_out`, we can't simply store this sum in an $n$-bit wire because we will have overflow. To prevent overflow, our temporary `sum` wire must be of size $k+n$. So in our circuit we have: |
| 77 | +```v |
| 78 | +reg unsigned [AccWidth-1:0] acc; |
| 79 | +``` |
| 80 | +Note that this is not a `reg` in the circuit's i/o list; it's a local/intermediate `reg`. |
| 81 | + |
| 82 | +### `localparam` |
| 83 | + |
| 84 | +For convenience, we have: |
| 85 | +```v |
| 86 | +localparam integer N = 1 << Exponent; |
| 87 | +localparam integer AccWidth = DataWidth + Exponent; |
| 88 | +``` |
| 89 | + |
| 90 | +`N` defines the aforementioned $n$ but uppercase to match formatting standards. `AccWidth` is the bit depth of the `acc` reg, which is the numerator in the average calculation, AKA the sum of the elements of `array`. It's of depth $k+n$. |
| 91 | + |
| 92 | +The `localparam` keybord is a bit like C's `#define`. We setup these aliases to avoid pasting those rvalues all over the place. |
| 93 | + |
| 94 | +### `for` loops in hardware description languages |
| 95 | + |
| 96 | +In the code below, you'll see a loop that instantiates the elements of the `array` of sample history, here named `sample_buffer`: |
| 97 | +```v |
| 98 | +for (i = 0; i < N; i++) begin |
| 99 | + sample_buffer[i] <= '0; |
| 100 | +end |
| 101 | +``` |
| 102 | + |
| 103 | +Remember that we're describing hardware, not writing software. `for` here is not the imperative `for` loop that it is in software. Here, it essentially is a compile-time `for` that expands into an assignment of each element of the sample buffer to `0` without having to copy-paste that assignment for each hardcoded `i` upto `N`. This loop does NOT perform some type of iterative operation at runtime. |
| 104 | + |
| 105 | + |
| 106 | +### Sign extension |
| 107 | + |
| 108 | +Recall the operation where we subtract out the oldest sample and add in the newest. Here's the code for that: |
| 109 | +```verilog |
| 110 | +acc <= acc |
| 111 | + - { |
| 112 | + {(AccWidth - DataWidth){sample_buffer[oldest_index][DataWidth-1]}}, |
| 113 | + sample_buffer[oldest_index] |
| 114 | + } |
| 115 | + + {{(AccWidth - DataWidth){d_in[DataWidth-1]}}, d_in}; |
| 116 | +``` |
| 117 | + |
| 118 | +In short, we're setting `acc <= acc - subtrahend + addend`. |
| 119 | + |
| 120 | +In the subtrahend, `{(AccWidth - DataWidth){sample_buffer[oldest_index][DataWidth-1]}}` performs a sign-extension. That's a bit confusing though since all these values are unsigned. It's really just a pattern to extend `sample_buffer[oldest_index]` to the size of the larger `acc` to allow them to be subtracted. It takes the most significant bit of the oldest sample and replicated it upto the size of `acc`, then concatenates that with the oldest sample itself. |
| 121 | + |
| 122 | +Similarly, `{{(AccWidth - DataWidth){d_in[DataWidth-1]}}, d_in}` expands `d_in` to the width of `acc`. |
| 123 | + |
| 124 | +## The entire SystemVerilog core |
| 125 | + |
| 126 | +With the intuition and notes out of the way, we've almost accidentally built the whole circuit. It's pasted below with comments inline. |
| 127 | + |
| 128 | +```verilog |
| 129 | +module moving_average_accumulator #( |
| 130 | + // we use Exponent to enforce that k = 2^n |
| 131 | + parameter integer Exponent /*verilator public*/ = 3, |
| 132 | + parameter integer DataWidth /*verilator public*/ = 16 // n |
| 133 | +) ( |
| 134 | + input wire clk, |
| 135 | + input wire reset, |
| 136 | + input wire unsigned [DataWidth-1:0] d_in, |
| 137 | + output reg unsigned [DataWidth-1:0] d_out |
| 138 | +); |
| 139 | +
|
| 140 | + localparam integer K = 1 << Exponent; |
| 141 | + localparam integer AccWidth = DataWidth + Exponent; |
| 142 | +
|
| 143 | + // holds sum of last `N` samples |
| 144 | + reg unsigned [AccWidth-1:0] acc; |
| 145 | +
|
| 146 | + // circular buffer storing last k samples |
| 147 | + reg unsigned [DataWidth-1:0] sample_buffer[K]; |
| 148 | +
|
| 149 | + // index of oldest element in `sample_buffer` |
| 150 | + integer oldest_index; |
| 151 | +
|
| 152 | + // loop |
| 153 | + integer i; |
| 154 | +
|
| 155 | + always_ff @(posedge clk or posedge reset) begin |
| 156 | + if (reset) begin |
| 157 | + acc <= '0; |
| 158 | + d_out <= '0; |
| 159 | +
|
| 160 | + // reset sample buffer |
| 161 | + for (i = 0; i < K; i++) begin |
| 162 | + sample_buffer[i] <= '0; |
| 163 | + end |
| 164 | +
|
| 165 | + oldest_index <= 0; |
| 166 | + end else begin |
| 167 | + // subtract oldest, add newest |
| 168 | + acc <= acc |
| 169 | + - { |
| 170 | + {(AccWidth - DataWidth){sample_buffer[oldest_index][DataWidth-1]}}, |
| 171 | + sample_buffer[oldest_index] |
| 172 | + } |
| 173 | + + {{(AccWidth - DataWidth){d_in[DataWidth-1]}}, d_in}; |
| 174 | +
|
| 175 | + // overwrite oldest sample |
| 176 | + sample_buffer[oldest_index] <= d_in; |
| 177 | +
|
| 178 | + // inc oldest_index. wrap around using modulo K |
| 179 | + oldest_index <= (oldest_index + 1) % K; |
| 180 | +
|
| 181 | + // compute moving average by dividing by 2^exponent |
| 182 | + d_out <= acc[AccWidth-1:Exponent]; // like a right shift |
| 183 | + end |
| 184 | + end |
| 185 | +
|
| 186 | +endmodule |
| 187 | +``` |
| 188 | + |
| 189 | +## Testbench |
| 190 | + |
| 191 | +We have the gateware core completed; now let's test it. If we can assume some knowledge of C++, I can just explain the testbench in comments: |
| 192 | + |
| 193 | +```cpp |
| 194 | +// verilator compiles our mov_avg_acc.sv to cpp, that's the file below |
| 195 | +#include "Vmoving_average_accumulator.h" |
| 196 | +// we'll instantiate it in our function, run some inputs through it, and compare |
| 197 | + |
| 198 | +#include "verilated.h" |
| 199 | +#include "verilated_vcd_c.h" |
| 200 | +#include <cstdlib> |
| 201 | +#include <iostream> |
| 202 | +#include <vector> |
| 203 | + |
| 204 | +#define SIM_TIME 20 // Simulation time in clock cycles |
| 205 | + |
| 206 | +int main(int argc, char **argv) { |
| 207 | + // setup |
| 208 | + Verilated::commandArgs(argc, argv); |
| 209 | + // top is a C++ version of the SystemVerilog core we wrote |
| 210 | + Vmoving_average_accumulator *top = new Vmoving_average_accumulator; |
| 211 | + VerilatedVcdC *tfp = nullptr; |
| 212 | + vluint64_t sim_time = 0; |
| 213 | + |
| 214 | + // these are the input wires we had in out circuit. |
| 215 | + // we're setting them up |
| 216 | + // reset->1 sets up the sample_buffer |
| 217 | + top->clk = 0; |
| 218 | + top->reset = 1; |
| 219 | + top->d_in = 0; |
| 220 | + |
| 221 | + // what we're really doing from here down is setting up the "expected" values |
| 222 | + // that we'll compare to the output of our verilog circuit. |
| 223 | + // to get these expected values, we need to rewrite our circuit's logic in C++ |
| 224 | + |
| 225 | + // same constants from sv |
| 226 | + const int Exponent = 3; |
| 227 | + const int N = 1 << Exponent; |
| 228 | + const int DataWidth = 16; |
| 229 | + |
| 230 | + // our C++ version doesn't naturally respond to the clock |
| 231 | + // and hence doesn't have a delay in its output |
| 232 | + // we store the values like this to mock this delay |
| 233 | + uint16_t one_ago = 0; |
| 234 | + uint16_t two_ago = 0; |
| 235 | + uint16_t expected_out = 0; |
| 236 | + |
| 237 | + std::vector<uint16_t> sample_buffer(N, 0); |
| 238 | + uint32_t acc = 0; // using 32 bits to prevent overflow |
| 239 | + |
| 240 | + // current index pointing to the oldest sample |
| 241 | + int oldest_index = 0; |
| 242 | + |
| 243 | + // simulation loop |
| 244 | + while (sim_time < SIM_TIME) { |
| 245 | + // invert clock |
| 246 | + top->clk = !top->clk; |
| 247 | + |
| 248 | + // allow time for reset |
| 249 | + if (sim_time > 4) |
| 250 | + top->reset = 0; |
| 251 | + |
| 252 | + // the real logic for the rising edge of the clock |
| 253 | + if (top->clk) { |
| 254 | + // pseudorandom input price |
| 255 | + uint16_t input_data = (3 * sim_time + 2) % 8031; |
| 256 | + top->d_in = input_data; // assert d_in with that price |
| 257 | + |
| 258 | + // remake the logic in C++ |
| 259 | + if (!top->reset) { |
| 260 | + acc -= sample_buffer[oldest_index]; |
| 261 | + acc += input_data; |
| 262 | + sample_buffer[oldest_index] = input_data; |
| 263 | + |
| 264 | + oldest_index = (oldest_index + 1) % N; |
| 265 | + |
| 266 | + // mock the delay |
| 267 | + two_ago = one_ago; |
| 268 | + one_ago = expected_out; |
| 269 | + expected_out = acc >> Exponent; |
| 270 | + |
| 271 | + // compare and err if bad |
| 272 | + // if the sim ends without error, we passed |
| 273 | + if (top->d_out != two_ago) { |
| 274 | + std::cerr << "Mismatch on simtime" << sim_time << ": expected " |
| 275 | + << two_ago << ", got " << top->d_out << std::endl; |
| 276 | + return EXIT_FAILURE; |
| 277 | + } |
| 278 | + } |
| 279 | + } |
| 280 | + |
| 281 | + top->eval(); |
| 282 | + |
| 283 | + |
| 284 | + sim_time++; |
| 285 | + } |
| 286 | + |
| 287 | + top->final(); |
| 288 | + |
| 289 | + // cleanup |
| 290 | + delete top; |
| 291 | + |
| 292 | + // test passed |
| 293 | + std::cout << "Simulation completed successfully." << std::endl; |
| 294 | + return EXIT_SUCCESS; |
| 295 | +} |
| 296 | +``` |
| 297 | +
|
| 298 | +## Makefile |
| 299 | +
|
| 300 | +In order to run this locally and in CI, we'll use this `Makefile`: |
| 301 | +```make |
| 302 | +# Top-level module name |
| 303 | +TOP_MODULE = moving_average_accumulator |
| 304 | +
|
| 305 | +# Verilog source files |
| 306 | +VERILOG_SOURCES = $(TOP_MODULE).sv |
| 307 | +
|
| 308 | +# C++ testbench file |
| 309 | +TESTBENCH = tb.cpp |
| 310 | +
|
| 311 | +# Output directory |
| 312 | +OBJ_DIR = obj_dir |
| 313 | +
|
| 314 | +# Simulation executable |
| 315 | +SIM_EXE = $(OBJ_DIR)/V$(TOP_MODULE) |
| 316 | +
|
| 317 | +# Default target |
| 318 | +.PHONY: all |
| 319 | +all: run_simulation |
| 320 | +
|
| 321 | +# Compile Verilog and C++ sources |
| 322 | +$(SIM_EXE): $(VERILOG_SOURCES) $(TESTBENCH) |
| 323 | + verilator --cc $(VERILOG_SOURCES) --exe $(TESTBENCH) --trace |
| 324 | + make -j -C $(OBJ_DIR) -f V$(TOP_MODULE).mk V$(TOP_MODULE) |
| 325 | +
|
| 326 | +# Run the simulation |
| 327 | +.PHONY: run_simulation |
| 328 | +run_simulation: $(SIM_EXE) |
| 329 | + ./$(SIM_EXE) |
| 330 | +
|
| 331 | +# Clean up generated files |
| 332 | +.PHONY: clean |
| 333 | +clean: |
| 334 | + rm -rf $(OBJ_DIR) waveform.vcd |
| 335 | +
|
| 336 | +# Phony target for CI testing (exits with non-zero status on failure) |
| 337 | +.PHONY: test |
| 338 | +test: run_simulation |
| 339 | +``` |
| 340 | + |
| 341 | +## Running it |
| 342 | + |
| 343 | +That's all the code. Let's run it. If you have the dependecies setup, run `make test`. |
0 commit comments