Analyze interpreter-only mode performance and seek future optimization

After running CoreMark in interpreter-only mode on an AMD Ryzen Threadripper 2990WX, the following performance characteristics were observed.

Profiling results (perf)
* `rv_step`: 32.94%
  Block dispatch, lookup, and control overhead
* `do_lw`: 6.76%
  Load word instruction
* `do_addi`: 6.47%
  Add immediate
* `do_beq`: 5.74%
  Branch if equal
* `do_fuse5`: 4.69%
  Fused instruction sequence
* Other instructions: 3–5% each

Optimizations attempted: `mem_base` pointer caching
* Added a `uint8_t *mem_base` field to the `riscv_internal` struct
* Updated `ram_read_*` and `ram_write_*` functions to use the cached pointer
* Result: approximately 0.4% improvement (within noise), confirming that modern compilers already optimize this access pattern effectively

Performance measurements:

* Baseline with mem_base caching: ~1042–1057 iterations/sec on CoreMark
* The mem_base caching remains in place in `src/riscv_private.h` and `src/riscv.c`

Observations
1. Modern compilers are already effective
   The mem_base caching optimization showed minimal benefit because Clang is able to optimize the pointer chain on its own.
2. Micro-optimizations can backfire
   Adding branch hints and extra cache checks reduced performance, likely due to:
   * Additional conditional branches interfering with branch prediction
   * Changes in code layout affecting instruction cache behavior
   * Different compiler optimization decisions triggered by code structure changes
3. The 33% cost in rv_step is largely structural
   This overhead includes block lookup, chaining setup, and loop control. The current implementation already uses block chaining for branches via `branch_taken` and `branch_untaken`.

Analysis: why rv_step accounts for ~33%

- [ ] Issue 1: fall-through blocks are not chained

Block chaining currently only applies to explicit branches. When a block ends with a non-branch instruction, `ir->next` is NULL, forcing a return to `rv_step` at every block boundary.

Example:
```
Block A: [add, sub, mul, lw]   -> ir->next = NULL at lw
Block B: [addi, sw, beq, ...]
```

Even though Block A falls through directly to Block B, execution returns to `rv_step` to locate Block B.

- [ ] Issue 2: per-instruction cycle counting

Each `RVOP` handler performs `cycle++`. For simple ALU operations such as addi or add, this overhead is significant relative to the actual instruction work.

Possible optimization: cross-block threading for fall-through paths

Extend `rv_step` to dynamically link fall-through blocks:
```c
if (prev) {
    rv_insn_t *last_ir = prev->ir_tail;
    /* If the last instruction is NOT a branch or jump, link fall-through */
    if (!insn_is_branch_or_jump(last_ir->opcode)) {
        last_ir->next = block->ir_head;
    }
}
```

This creates dynamic superblocks spanning multiple basic blocks, allowing execution to remain within the tail-call chain and reducing returns to `rv_step`.

Possible optimization: block-level cycle counting

Move cycle accounting from per-instruction to per-block:
1. Add `uint32_t cycle_cost` to `block_t`, computed during translation
2. Remove `cycle++` from the `RVOP` macro
3. Add `rv->csr_cycle += block->cycle_cost` at block entry

Benefits:
* Eliminates arithmetic in hot instruction handlers
* Particularly beneficial for simple ALU instructions such as addi, add, and, or

Alternative direction: dense PC-to-block mapping

Replace the open-addressing hash table with one of the following:
* A two-level page table indexed by `PC >> 12`
* A per-segment flat array for RAM-backed address ranges

This approach aims to reduce `block_find()` probe overhead without introducing additional conditional branches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Analyze interpreter-only mode performance and seek future optimization #682

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Analyze interpreter-only mode performance and seek future optimization #682

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions