11--------------------------------------------------------------------------------
22Readback mux layer
33--------------------------------------------------------------------------------
4+ Use a large always_comb block + many if statements that select the read data
5+ based on the cpuif address.
6+ Loops are handled the same way as address decode.
47
5- Implementation:
6- - Big always_comb block
7- - Initialize default rd_data value
8- - Lotsa if statements that operate on reg strb to assign rd_data
9- - Merges all fields together into reg
10- - pulls value from storage element struct, or input struct
11- - Provision for optional flop stage?
12-
13- Mux Strategy:
14- Flat case statement:
15- -- Cant parameterize
16- + better performance?
17-
18- Flat 1-hot array then OR reduce:
19- - Create a bus-wide flat array
20- eg: 32-bits x N readable registers
21- - Assign each element:
22- the readback value of each register
23- ... masked by the register's access strobe
24- - I could also stuff an extra bit into the array that denotes the read is valid
25- A missed read will OR reduce down to a 0
26- - Finally, OR reduce all the elements in the array down to a flat 32-bit bus
27- - Retiming the large OR fanin can be done by chopping up the array into stages
28- for 2 stages, sqrt(N) gives each stage's fanin size. Round to favor
29- more fanin on 2nd stage
30- 3 stages uses cube-root. etc...
31- - This has the benefit of re-using the address decode logic.
32- synth can choose to replicate logic if fanout is bad
8+ Other options that were considered:
9+ - Flat case statement
10+ con: Difficult to represent arrays. Essentially requires unrolling
11+ con: complicates retiming strategies
12+ con: Representing a range (required for externals) is cumbersome. Possible with stacked casez wildcards.
13+ - AND field data with strobe, then massive OR reduce
14+ This was the strategy prior to v1.3, but turned out to infer more overhead
15+ than originally anticipated
16+ - Assigning data to a flat register array, then directly indexing via address
17+ con: Would work fine, but scales poorly for sparse regblocks.
18+ Namely, simulators would likely allocate memory for the entire array
19+ - Assign to a flat array that is packed sequentially, then directly indexing using a derived packed index
20+ Concern that for sparse regfiles, the translation of addr --> packed index
21+ becomes a nontrivial logic function
22+
23+ Pros:
24+ - Scales well for arrays since loops can be used
25+ - Externals work well, as address ranges can be compared
26+ - Synthesis results show more efficient logic inference
27+
28+ Example:
29+ logic [7:0] out;
30+ always_comb begin
31+ out = '0;
32+ for(int i=0; i<64; i++) begin
33+ if(i == addr) out = data[i];
34+ end
35+ end
36+
37+
38+ How to implement retiming:
39+ Ideally this would partition the design into several equal sub-regions, but
40+ with loop structures, this is pretty difficult..
41+ What if instead, it is partitioned into equal address ranges?
42+
43+ First stage compares the lower-half of the address bits.
44+ Values are assigned to the appropriate output "bin"
45+
46+ logic [7:0] out[8];
47+ always_comb begin
48+ for(int i=0; i<8; i++) out[i] = '0;
49+
50+ for(int i=0; i<64; i++) begin
51+ automatic bit [5:0] this_addr = i;
52+
53+ if(this_addr[2:0] == addr[2:0]) out[this_addr[5:3]] = data[i];
54+ end
55+ end
56+
57+ (not showing retiming ff for `out` and `addr`)
58+ The second stage muxes down the resulting bins using the high address bits.
59+ If the user up-sizes the address bits, need to check the upper bits to prevent aliasing
60+ Assuming min address bit range is [5:0], but it was padded up to [8:0], do the following:
61+
62+ logic [7:0] rd_data;
63+ always_comb begin
64+ if(addr[8:6] != '0) begin
65+ // Invalid read range
66+ rd_data = '0;
67+ end else begin
68+ rd_data = out[addr[5:3]];
69+ end
70+ end
71+
72+ Retiming with external blocks
73+ One minor downside is the above scheme does not work well for external blocks
74+ that span a range of addresses. Depending on the range, it may span multiple
75+ retiming bins which complicates how this would be assigned cleanly.
76+ This would be complicated even further with arrays of externals since the
77+ span of bins could change depending on the iteration.
78+
79+ Since externals can already be retimed, and large fanin of external blocks
80+ is likely less of a concern, implement these as a separate readback mux on
81+ the side that does not get retimed at all.
3382
3483
3584WARNING:
@@ -42,8 +91,14 @@ WARNING:
4291
4392Forwards response strobe back up to cpu interface layer
4493
45- TODO:
46- Dont forget about alias registers here
4794
48- TODO:
49- Does the endinness the user sets matter anywhere?
95+ Variables:
96+ From decode:
97+ decoded_addr
98+ decoded_req
99+ decoded_req_is_wr
100+
101+ Response:
102+ readback_done
103+ readback_err
104+ readback_data
0 commit comments