-
Notifications
You must be signed in to change notification settings - Fork 137
Description
After running CoreMark in interpreter-only mode on an AMD Ryzen Threadripper 2990WX, the following performance characteristics were observed.
Profiling results (perf)
rv_step: 32.94%
Block dispatch, lookup, and control overheaddo_lw: 6.76%
Load word instructiondo_addi: 6.47%
Add immediatedo_beq: 5.74%
Branch if equaldo_fuse5: 4.69%
Fused instruction sequence- Other instructions: 3–5% each
Optimizations attempted: mem_base pointer caching
- Added a
uint8_t *mem_basefield to theriscv_internalstruct - Updated
ram_read_*andram_write_*functions to use the cached pointer - Result: approximately 0.4% improvement (within noise), confirming that modern compilers already optimize this access pattern effectively
Performance measurements:
- Baseline with mem_base caching: ~1042–1057 iterations/sec on CoreMark
- The mem_base caching remains in place in
src/riscv_private.handsrc/riscv.c
Observations
- Modern compilers are already effective
The mem_base caching optimization showed minimal benefit because Clang is able to optimize the pointer chain on its own. - Micro-optimizations can backfire
Adding branch hints and extra cache checks reduced performance, likely due to:- Additional conditional branches interfering with branch prediction
- Changes in code layout affecting instruction cache behavior
- Different compiler optimization decisions triggered by code structure changes
- The 33% cost in rv_step is largely structural
This overhead includes block lookup, chaining setup, and loop control. The current implementation already uses block chaining for branches viabranch_takenandbranch_untaken.
Analysis: why rv_step accounts for ~33%
- Issue 1: fall-through blocks are not chained
Block chaining currently only applies to explicit branches. When a block ends with a non-branch instruction, ir->next is NULL, forcing a return to rv_step at every block boundary.
Example:
Block A: [add, sub, mul, lw] -> ir->next = NULL at lw
Block B: [addi, sw, beq, ...]
Even though Block A falls through directly to Block B, execution returns to rv_step to locate Block B.
- Issue 2: per-instruction cycle counting
Each RVOP handler performs cycle++. For simple ALU operations such as addi or add, this overhead is significant relative to the actual instruction work.
Possible optimization: cross-block threading for fall-through paths
Extend rv_step to dynamically link fall-through blocks:
if (prev) {
rv_insn_t *last_ir = prev->ir_tail;
/* If the last instruction is NOT a branch or jump, link fall-through */
if (!insn_is_branch_or_jump(last_ir->opcode)) {
last_ir->next = block->ir_head;
}
}This creates dynamic superblocks spanning multiple basic blocks, allowing execution to remain within the tail-call chain and reducing returns to rv_step.
Possible optimization: block-level cycle counting
Move cycle accounting from per-instruction to per-block:
- Add
uint32_t cycle_costtoblock_t, computed during translation - Remove
cycle++from theRVOPmacro - Add
rv->csr_cycle += block->cycle_costat block entry
Benefits:
- Eliminates arithmetic in hot instruction handlers
- Particularly beneficial for simple ALU instructions such as addi, add, and, or
Alternative direction: dense PC-to-block mapping
Replace the open-addressing hash table with one of the following:
- A two-level page table indexed by
PC >> 12 - A per-segment flat array for RAM-backed address ranges
This approach aims to reduce block_find() probe overhead without introducing additional conditional branches.