Skip to content

Analyze interpreter-only mode performance and seek future optimization #682

@jserv

Description

@jserv

After running CoreMark in interpreter-only mode on an AMD Ryzen Threadripper 2990WX, the following performance characteristics were observed.

Profiling results (perf)

  • rv_step: 32.94%
    Block dispatch, lookup, and control overhead
  • do_lw: 6.76%
    Load word instruction
  • do_addi: 6.47%
    Add immediate
  • do_beq: 5.74%
    Branch if equal
  • do_fuse5: 4.69%
    Fused instruction sequence
  • Other instructions: 3–5% each

Optimizations attempted: mem_base pointer caching

  • Added a uint8_t *mem_base field to the riscv_internal struct
  • Updated ram_read_* and ram_write_* functions to use the cached pointer
  • Result: approximately 0.4% improvement (within noise), confirming that modern compilers already optimize this access pattern effectively

Performance measurements:

  • Baseline with mem_base caching: ~1042–1057 iterations/sec on CoreMark
  • The mem_base caching remains in place in src/riscv_private.h and src/riscv.c

Observations

  1. Modern compilers are already effective
    The mem_base caching optimization showed minimal benefit because Clang is able to optimize the pointer chain on its own.
  2. Micro-optimizations can backfire
    Adding branch hints and extra cache checks reduced performance, likely due to:
    • Additional conditional branches interfering with branch prediction
    • Changes in code layout affecting instruction cache behavior
    • Different compiler optimization decisions triggered by code structure changes
  3. The 33% cost in rv_step is largely structural
    This overhead includes block lookup, chaining setup, and loop control. The current implementation already uses block chaining for branches via branch_taken and branch_untaken.

Analysis: why rv_step accounts for ~33%

  • Issue 1: fall-through blocks are not chained

Block chaining currently only applies to explicit branches. When a block ends with a non-branch instruction, ir->next is NULL, forcing a return to rv_step at every block boundary.

Example:

Block A: [add, sub, mul, lw]   -> ir->next = NULL at lw
Block B: [addi, sw, beq, ...]

Even though Block A falls through directly to Block B, execution returns to rv_step to locate Block B.

  • Issue 2: per-instruction cycle counting

Each RVOP handler performs cycle++. For simple ALU operations such as addi or add, this overhead is significant relative to the actual instruction work.

Possible optimization: cross-block threading for fall-through paths

Extend rv_step to dynamically link fall-through blocks:

if (prev) {
    rv_insn_t *last_ir = prev->ir_tail;
    /* If the last instruction is NOT a branch or jump, link fall-through */
    if (!insn_is_branch_or_jump(last_ir->opcode)) {
        last_ir->next = block->ir_head;
    }
}

This creates dynamic superblocks spanning multiple basic blocks, allowing execution to remain within the tail-call chain and reducing returns to rv_step.

Possible optimization: block-level cycle counting

Move cycle accounting from per-instruction to per-block:

  1. Add uint32_t cycle_cost to block_t, computed during translation
  2. Remove cycle++ from the RVOP macro
  3. Add rv->csr_cycle += block->cycle_cost at block entry

Benefits:

  • Eliminates arithmetic in hot instruction handlers
  • Particularly beneficial for simple ALU instructions such as addi, add, and, or

Alternative direction: dense PC-to-block mapping

Replace the open-addressing hash table with one of the following:

  • A two-level page table indexed by PC >> 12
  • A per-segment flat array for RAM-backed address ranges

This approach aims to reduce block_find() probe overhead without introducing additional conditional branches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions