Skip to content

Commit 9fc95b8

Browse files
committed
Refactor readback mux implementation. Improves performance (#155) and eliminates illegal streaming operator usage (#165)
1 parent 4201ce9 commit 9fc95b8

24 files changed

+1118
-636
lines changed

docs/architecture.rst

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -38,18 +38,15 @@ This section also assigns any hardware interface outputs.
3838

3939
Readback
4040
--------
41-
The readback layer aggregates and reduces all readable registers into a single
42-
read response. During a read operation, the same address decode strobes are used
43-
to select the active register that is being accessed.
44-
This allows for a simple OR-reduction operation to be used to compute the read
45-
data response.
41+
The readback layer aggregates and MUXes all readable registers into a single
42+
read response.
4643

4744
For designs with a large number of software-readable registers, an optional
4845
fanin re-timing stage can be enabled. This stage is automatically inserted at a
4946
balanced point in the read-data reduction so that fanin and logic-levels are
5047
optimally reduced.
5148

52-
.. figure:: diagrams/readback.png
49+
.. figure:: diagrams/rt-readback-fanin.png
5350
:width: 65%
5451
:align: center
5552

docs/dev_notes/Alpha-Beta Versioning

Lines changed: 0 additions & 10 deletions
This file was deleted.
Lines changed: 87 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,84 @@
11
--------------------------------------------------------------------------------
22
Readback mux layer
33
--------------------------------------------------------------------------------
4+
Use a large always_comb block + many if statements that select the read data
5+
based on the cpuif address.
6+
Loops are handled the same way as address decode.
47

5-
Implementation:
6-
- Big always_comb block
7-
- Initialize default rd_data value
8-
- Lotsa if statements that operate on reg strb to assign rd_data
9-
- Merges all fields together into reg
10-
- pulls value from storage element struct, or input struct
11-
- Provision for optional flop stage?
12-
13-
Mux Strategy:
14-
Flat case statement:
15-
-- Cant parameterize
16-
+ better performance?
17-
18-
Flat 1-hot array then OR reduce:
19-
- Create a bus-wide flat array
20-
eg: 32-bits x N readable registers
21-
- Assign each element:
22-
the readback value of each register
23-
... masked by the register's access strobe
24-
- I could also stuff an extra bit into the array that denotes the read is valid
25-
A missed read will OR reduce down to a 0
26-
- Finally, OR reduce all the elements in the array down to a flat 32-bit bus
27-
- Retiming the large OR fanin can be done by chopping up the array into stages
28-
for 2 stages, sqrt(N) gives each stage's fanin size. Round to favor
29-
more fanin on 2nd stage
30-
3 stages uses cube-root. etc...
31-
- This has the benefit of re-using the address decode logic.
32-
synth can choose to replicate logic if fanout is bad
8+
Other options that were considered:
9+
- Flat case statement
10+
con: Difficult to represent arrays. Essentially requires unrolling
11+
con: complicates retiming strategies
12+
con: Representing a range (required for externals) is cumbersome. Possible with stacked casez wildcards.
13+
- AND field data with strobe, then massive OR reduce
14+
This was the strategy prior to v1.3, but turned out to infer more overhead
15+
than originally anticipated
16+
- Assigning data to a flat register array, then directly indexing via address
17+
con: Would work fine, but scales poorly for sparse regblocks.
18+
Namely, simulators would likely allocate memory for the entire array
19+
- Assign to a flat array that is packed sequentially, then directly indexing using a derived packed index
20+
Concern that for sparse regfiles, the translation of addr --> packed index
21+
becomes a nontrivial logic function
22+
23+
Pros:
24+
- Scales well for arrays since loops can be used
25+
- Externals work well, as address ranges can be compared
26+
- Synthesis results show more efficient logic inference
27+
28+
Example:
29+
logic [7:0] out;
30+
always_comb begin
31+
out = '0;
32+
for(int i=0; i<64; i++) begin
33+
if(i == addr) out = data[i];
34+
end
35+
end
36+
37+
38+
How to implement retiming:
39+
Ideally this would partition the design into several equal sub-regions, but
40+
with loop structures, this is pretty difficult..
41+
What if instead, it is partitioned into equal address ranges?
42+
43+
First stage compares the lower-half of the address bits.
44+
Values are assigned to the appropriate output "bin"
45+
46+
logic [7:0] out[8];
47+
always_comb begin
48+
for(int i=0; i<8; i++) out[i] = '0;
49+
50+
for(int i=0; i<64; i++) begin
51+
automatic bit [5:0] this_addr = i;
52+
53+
if(this_addr[2:0] == addr[2:0]) out[this_addr[5:3]] = data[i];
54+
end
55+
end
56+
57+
(not showing retiming ff for `out` and `addr`)
58+
The second stage muxes down the resulting bins using the high address bits.
59+
If the user up-sizes the address bits, need to check the upper bits to prevent aliasing
60+
Assuming min address bit range is [5:0], but it was padded up to [8:0], do the following:
61+
62+
logic [7:0] rd_data;
63+
always_comb begin
64+
if(addr[8:6] != '0) begin
65+
// Invalid read range
66+
rd_data = '0;
67+
end else begin
68+
rd_data = out[addr[5:3]];
69+
end
70+
end
71+
72+
Retiming with external blocks
73+
One minor downside is the above scheme does not work well for external blocks
74+
that span a range of addresses. Depending on the range, it may span multiple
75+
retiming bins which complicates how this would be assigned cleanly.
76+
This would be complicated even further with arrays of externals since the
77+
span of bins could change depending on the iteration.
78+
79+
Since externals can already be retimed, and large fanin of external blocks
80+
is likely less of a concern, implement these as a separate readback mux on
81+
the side that does not get retimed at all.
3382

3483

3584
WARNING:
@@ -42,8 +91,14 @@ WARNING:
4291

4392
Forwards response strobe back up to cpu interface layer
4493

45-
TODO:
46-
Dont forget about alias registers here
4794

48-
TODO:
49-
Does the endinness the user sets matter anywhere?
95+
Variables:
96+
From decode:
97+
decoded_addr
98+
decoded_req
99+
decoded_req_is_wr
100+
101+
Response:
102+
readback_done
103+
readback_err
104+
readback_data

docs/diagrams/diagrams.odg

-1.46 KB
Binary file not shown.

docs/diagrams/readback.png

-88.6 KB
Binary file not shown.

0 commit comments

Comments
 (0)