Skip to content

Conversation

@vogma
Copy link
Contributor

@vogma vogma commented Oct 30, 2025

This draft PR adds initial support for the ratified RISC-V Zc subset Zcmp (PUSH/POP and double-move macros) to NEORV32. The implementation of this extension has been discussed in #633, but was closed as not planned. I think this extension is interesting and so I gave the implementation a shot anyway. This is not ready to be merged but I would like to start a discussion on how the implementation can be refined. The current implementation executes all Zcmp instructions in simulation and on hardware. More specifically, the processor_check passes in simulation and the examples in sw/example run on my Arty A7-100T board.

I have summarized some information below. I would appreciate any feedback :)

Synthesis results (Vivado 2025.1)

Baseline configuration: rv32imc_zicntr
With Zcmp enabled: rv32imc_zicntr_zcmp

Top level

Config Slice LUTs Slice Registers LUT as Logic LUT as Memory
Baseline 1,939 1,468 1,915 24
+ Zcmp 2,132 1,511 2,108 24

Delta (top): +193 LUTs (+9.95%), +43 FFs (+2.92%)

Code-size impact (sw/example executables)

Program Zcmp (B) No Zcmp (B) Saved (B) Saved (%)
processor_check 25,992 26,352 360 1.37%
game_of_life 4,636 4,908 272 5.54%
coremark (-Os) 32,560 33,252 692 2.08%
bus_explorer 8,532 9,016 484 5.37%

The code size results are rather underwhelming for the example software shipped with the NEORV32. I guess because the examples are quite small, only a few cm.push and pop instructions can be generated by the compiler. I tried to compile the embench benchmark suite which was referenced in the Issue #633, but for now I have not been successful.

Short overview (Zcmp)

Zcmp adds six 16-bit macros:

  • cm.push / cm.pop
  • cm.popret / cm.popretz
  • cm.mvsa01 / cm.mva01s

Implementation notes

Frontend

  • Two existing FSMs remain: fetch (fills two 16-bit FIFOs) and issue (forms valid instruction words).

  • New uop_fsm is synthesized only when RISCV_ISA_Zcmp generic is set.
    When the decompressor detects Zcmp, issue_fsm enters a Zcmp state and uop_fsm emits the micro-op sequence.

  • The frontend outputs are multiplexed:

    • frontend_bus_issue (normal path)
    • frontend_bus_zcmp (uOp path)
      frontend_o selects between them via zcmp_uop_in_progress.

New signals from frontend to control-unit:

  • zcmp_in_uop_seq – Zcmp micro-op sequence active
  • zcmp_start – sequence begins next cycle
  • zcmp_atomic_tail – atomic tail section of the sequence (no traps allowed)

Control unit

  • On zcmp_start, the PC is latched. During the micro-op sequence exe_engine.pc / pc2 are held.
  • Trap logic is inhibited when zcmp_atomic_tail is asserted to honor POP/POPRET atomicity.

The trap/exception handling across the atomic tail likely needs refinement. I’d appreciate feedback.

Testing

I created test benches that execute every valid combination of the six Zcmp instructions and compare the affected architectural state against known-good outputs from the RISC-V Spike (riscv-isa-sim) reference simulator. For each sequence, the benches verify destination register values, stack-pointer updates, and control-flow effects (cm.popret/cm.popretz, including a0 zeroing).

The Hardware and Software implementation is based on the Verilog Zcmp implementation of the Hazard3 RISC-V CPU

Thanks for taking a look. I am happy to iterate based on your feedback.

@stnolting
Copy link
Owner

Hey @vogma, thanks for the PR and the detailed information!
I saw your work in the network view here on GitHub and followed it with great interest.

The implementation of this extension has been discussed in #633, but was closed as not planned.

That's right. To be honest, I'm not a fan of the Zcmp extension. An instruction that breaks down into many different instructions is somehow the opposite of what RISC originally stood for. But that's just me. 😉

Anyway, your results look pretty good, and I think that makes a discussion more than appropriate!

Synthesis results

+10% hardware costs - that's not bad for the code saving that you have provided!
Can you tell us anything about the maximum clock speed? Does the new extension affect the critical path?

Code-size impact

That's not bad at all. But code size isn't that important to me here. Since many instructions are replaced by a single (16-bit wide) instruction, this should also have a noticeable impact on bus traffic. Have you checked whether, for example, the processor check or Coremark run a little faster as a result?

Testing

That's really great work! I hope the riscv-arch-test TG comes up with some similar tests one day!

Did you check exception handling? What happens if any of the resulting memory instructions cause an exception? What happens when a debug request arrives during a Zcmp instruction?

(This is another point that bothers me about the Zcmp extension: exceptions are no longer precise and can no longer be assigned to a single operation.)

Frontend

I noticed that your code affects many modules in the CPU. Have you checked whether the entire code is optimized away when the ISA extension is disabled?

As I said, I'm not a fan of the Zcmp extension. But I also think that this should be left up to the end user. So I think your PR is good!

I'm just wondering if the RTL changes could be condensed a little more. Basically, you need an FSM between the issue logic and the CPU's execution microsequencer. As soon as a Zcmp instruction is executed, the PC of the execution unit is stalled and the update signal to the issue logic or instruction prefetch buffer is suppressed. The FSM can then send as many instructions as it likes to the execution unit.

So I'm wondering if we could move most of the Zcmp logic into a new module that would be placed within the fron-end bus system. 🤔

@vogma
Copy link
Contributor Author

vogma commented Nov 3, 2025

Thank you for the feedback @stnolting

To be honest, I'm not a fan of the Zcmp extension. An instruction that breaks down into many different instructions is somehow the opposite of what RISC originally stood for. But that's just me. 😉

I agree it's quite different from all the other available RISC-V extensions. I only chose this extensions because it hasn't been implemented yet into this cpu core and i wanted to work with the NEORV32 :)

Can you tell us anything about the maximum clock speed? Does the new extension affect the critical path?

I haven't done extensive tests for Fmax yet, but when synthesized with a 100MHz clock signal the zcmp-enabled implementation has a worst-negative-slack (WNS) of +1.112 ns which is a bit better than without the zcmp extension (WNS=+1.03ns).

Since many instructions are replaced by a single (16-bit wide) instruction, this should also have a noticeable impact on bus traffic. Have you checked whether, for example, the processor check or Coremark run a little faster as a result?

Thats a good point. I just noticed that in my branch the coremark executable throws a illegal instruction exception since the latest merge so i will have to investigate that before i can answer your question.

Did you check exception handling? What happens if any of the resulting memory instructions cause an exception? What happens when a debug request arrives during a Zcmp instruction?

Only in theory. The specification states that if the zcmp micro-op instruction sequence is not inside the atomic tail section, the sequence can be aborted and retried. If a trap occurs during these memory operations, the zcmp sequence is aborted. But i have not tested if the sequence is retried after that.

The behaviour during debugger requests is a great point. Thanks for bringing that up.

(This is another point that bothers me about the Zcmp extension: exceptions are no longer precise and can no longer be assigned to a single operation.)

True. Because the PC is held at the address of the Zcmp instruction, only this macro instruction will be flagged.

I noticed that your code affects many modules in the CPU. Have you checked whether the entire code is optimized away when the ISA extension is disabled?

Good point. I compared my implementation (Zcmp generic => false) with the main branch and 21 LUTs and 8 Registers are not optimized away by my implementation.

So I'm wondering if we could move most of the Zcmp logic into a new module that would be placed within the fron-end bus system. 🤔

Yes i will move as much logic as possible into a separate module. This will probably also fix the unaccounted LUTs and Registers from above.

Thank you for your feedback. I gained some valuable information and will work on my implementation in the coming weeks.

vogma added 2 commits November 8, 2025 20:18
During a double move instruction, exception handling is suspended until the instruction finishes but the restart on branch functionality has to be implemented. This implements the restart on branch during double moves
@stnolting
Copy link
Owner

I only chose this extensions because it hasn't been implemented yet into this cpu core and i wanted to work with the NEORV32 :)

That's cool, and I really appreciate it!

I haven't done extensive tests for Fmax yet, but when synthesized with a 100MHz clock signal the zcmp-enabled implementation has a worst-negative-slack (WNS) of +1.112 ns which is a bit better than without the zcmp extension (WNS=+1.03ns).

That sounds good!

Yes i will move as much logic as possible into a separate module. This will probably also fix the unaccounted LUTs and Registers from above.

Thanks a lot! Let me know if I can help in any way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants