Skip to content

Conversation

@LorenzoR01
Copy link

This PR introduces support for speculative loads, allowing dual issue of a control flow instruction and a load instruction.
Speculative loads improve Coremark performance by up to 7% with no impact on the CVA6 timing.
We have successfully tested the code on Linux and Splash3 benchmarks.

@LorenzoR01 LorenzoR01 force-pushed the lr/spld-rebase branch 2 times, most recently from 340290c to 9bc67a2 Compare January 30, 2026 11:04
Copy link

@numero-744 numero-744 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Unfortunately, I cannot review LSU's FSM as I am not familiar enough with it to understand the internal changes. I don't know enough about the TLB either.
Still, I'm wondering: in the execute stage, the branch unit answers immediately combinatorially, so do we need to go to a new state in LSU's FSM?

Is there a maximum frequency regression due to the ALU->BRANCH->LSU path, or is the logic of the FSM shallow enough?
How does the performance of this PR compare with the performance of the fast but wrong implementation where we only remove the stall condition from the issue stage?

Below are minor comments about the code.

Cc @Gchauvon

output logic commit_lsu_o,
// Commit buffer of LSU is ready - EX_STAGE
input logic commit_lsu_ready_i,
// Transaction id of first commit port - ID_STAGE

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment does not match the declaration anymore, we should update it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not accurate anymore

Comment on lines 174 to 176
output logic [5:0] orig_instr_aes_bits
output logic [5:0] orig_instr_aes_bits,
// Signals speculative loads to LSU for non-idempotent load handling - EX_STAGE
output logic speculative_load_o

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speculative_load_o should be inserted just after lsu_valid_o IMO

.rvfi_rs2_o (rvfi_rs2_o),
.orig_instr_aes_bits (orig_instr_aes_bits)
.orig_instr_aes_bits (orig_instr_aes_bits),
.speculative_load_o

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speculative_load_o should be inserted just after lsu_valid_o IMO

core/cva6.sv Outdated
.rvfi_rs2_o (rvfi_rs2),
.orig_instr_aes_bits (orig_instr_aes)
.orig_instr_aes_bits (orig_instr_aes),
.speculative_load_o (speculative_load)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speculative_load_o should be inserted just after lsu_valid_o IMO

core/cva6.sv Outdated
.lsu_commit_i (lsu_commit_commit_ex), // from commit
.lsu_commit_ready_o (lsu_commit_ready_ex_commit), // to commit
.commit_tran_id_i (lsu_commit_trans_id), // from commit
.speculative_load_i (speculative_load),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speculative_load_i should be inserted just after lsu_valid_i IMO

It is connected to issue stage, not commit stage, right?

// Signals speculative loads for non-idempotent load handling - ISSUE_STAGE
input logic speculative_load_i,
// Result from branch unit - EX_STAGE
input bp_resolve_t resolved_branch_i,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this new input add a critical path inside the execute stage?

core/ex_stage.sv Outdated
.commit_i (lsu_commit_i),
.commit_ready_o (lsu_commit_ready_o),
.commit_tran_id_i,
.speculative_load_i,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speculative_load_i should be inserted just after lsu_valid_i IMO

It is connected to issue stage, not commit stage, right?


input logic [CVA6Cfg.NrCommitPorts-1:0][CVA6Cfg.TRANS_ID_BITS-1:0] commit_tran_id_i,
// Signals speculative loads for non-idempotent load handling - ISSUE_STAGE
input logic speculative_load_i,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speculative_load_i should be inserted just after lsu_valid_i IMO

It is connected to issue stage, not commit stage, right?

input logic page_offset_matches_i,
// Store buffer is empty - STORE_UNIT
input logic store_buffer_empty_i,
// Transaction ID of the committing instruction - COMMIT_STAGE

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Transaction ID of the committing instruction - COMMIT_STAGE
// Transaction ID of the committing instructions - COMMIT_STAGE

end
assign inflight_stores = (!dcache_wbuffer_not_ni_i || !store_buffer_empty_i);
assign stall_ni = (inflight_stores || not_commit_time) && (paddr_ni && CVA6Cfg.NonIdemPotenceEn);
assign stall_ni = (inflight_stores || &not_commit_time) && (paddr_ni && CVA6Cfg.NonIdemPotenceEn);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the not_commit_time guaranteed that there was no previous instructions as it could be a branch miss or a exception.
Here &not_commit_time = 1'b0 if the load is the second committing instruction, so there might be an uncommitted instruction before it.

I do not yet understand why is this change needed (e.g.: correctness or performance?) and why it is correct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this change is not needed for speculative loads and now I also realized that this is wrong, so I'm removing this part.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm curious to know whether removing this part of the diff changes performance

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this part, performance are still the same

@LorenzoR01
Copy link
Author

Thanks for the contribution!

Unfortunately, I cannot review LSU's FSM as I am not familiar enough with it to understand the internal changes. I don't know enough about the TLB either. Still, I'm wondering: in the execute stage, the branch unit answers immediately combinatorially, so do we need to go to a new state in LSU's FSM?

Is there a maximum frequency regression due to the ALU->BRANCH->LSU path, or is the logic of the FSM shallow enough? How does the performance of this PR compare with the performance of the fast but wrong implementation where we only remove the stall condition from the issue stage?

Below are minor comments about the code.

Cc @Gchauvon

It should be possible to use the branch result combinationally to control the load unit FSM without a maximum frequency regression, however this combinational path will be longer than the one in this PR, since here the branch result is only used to update a register inside the lsu_bypass.

This PR checks the branch result in the next cycle (and there is a new state in the load unit FSM to do this) so it should have a small performance disadvantage with respect to the incorrect implementation where only the stall condition from the issue stage is removed. However, in my testing I didn’t find any significant performance difference between these two implementations; both achieve around +7% CoreMark/MHz. I also observed the same performance results when comparing this PR to an implementation where the branch result is used to control the load unit FSM combinationally.

core/ex_stage.sv Outdated
lsu_data = fu_data_i[1];
lsu_tinst = tinst_i[1];
if (CVA6Cfg.SpeculativeSb) begin
speculative_load = branch_valid_i[0] ? 1'b1 : 1'b0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
speculative_load = branch_valid_i[0] ? 1'b1 : 1'b0;
speculative_load = branch_valid_i[0];

Let's avoid the "if true then true else false" tautology

@numero-744
Copy link

LGTM except for the load unit FSM that I cannot review

@Gchauvon
Copy link
Contributor

Gchauvon commented Feb 4, 2026

The LSU looks good to me. Have you done some ASIC or FPGA synthesis ? Have you been able to test it with the linux boot on FPGA ?

@LorenzoR01
Copy link
Author

The LSU looks good to me. Have you done some ASIC or FPGA synthesis ? Have you been able to test it with the linux boot on FPGA ?

Yes, I tested Linux boot on FPGA and ran an ASIC synthesis to confirm that the critical path is unaffected by these changes.

@Gchauvon
Copy link
Contributor

Gchauvon commented Feb 5, 2026

Great ! Well it's ok for me.
Cc @JeanRochCoulon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants