WIP/RFC: Add RISC-V `Zcmp` code-size reduction extension to NEORV32 #1417

vogma · 2025-10-30T07:27:54Z

This draft PR adds initial support for the ratified RISC-V Zc subset Zcmp (PUSH/POP and double-move macros) to NEORV32. The implementation of this extension has been discussed in #633, but was closed as not planned. I think this extension is interesting and so I gave the implementation a shot anyway. This is not ready to be merged but I would like to start a discussion on how the implementation can be refined. The current implementation executes all Zcmp instructions in simulation and on hardware. More specifically, the processor_check passes in simulation and the examples in sw/example run on my Arty A7-100T board.

I have summarized some information below. I would appreciate any feedback :)

Synthesis results (Vivado 2025.1)

Baseline configuration: rv32imc_zicntr
With Zcmp enabled: rv32imc_zicntr_zcmp

Top level

Config	Slice LUTs	Slice Registers	LUT as Logic	LUT as Memory
Baseline	1,939	1,468	1,915	24
+ Zcmp	2,132	1,511	2,108	24

Delta (top): +193 LUTs (+9.95%), +43 FFs (+2.92%)

Code-size impact (`sw/example` executables)

Program	Zcmp (B)	No Zcmp (B)	Saved (B)	Saved (%)
`processor_check`	25,992	26,352	360	1.37%
`game_of_life`	4,636	4,908	272	5.54%
`coremark` (`-Os`)	32,560	33,252	692	2.08%
`bus_explorer`	8,532	9,016	484	5.37%

The code size results are rather underwhelming for the example software shipped with the NEORV32. I guess because the examples are quite small, only a few cm.push and pop instructions can be generated by the compiler. I tried to compile the embench benchmark suite which was referenced in the Issue #633, but for now I have not been successful.

Short overview (Zcmp)

Zcmp adds six 16-bit macros:

cm.push / cm.pop
cm.popret / cm.popretz
cm.mvsa01 / cm.mva01s

Implementation notes

Frontend

Two existing FSMs remain: fetch (fills two 16-bit FIFOs) and issue (forms valid instruction words).
New uop_fsm is synthesized only when RISCV_ISA_Zcmp generic is set.
When the decompressor detects Zcmp, issue_fsm enters a Zcmp state and uop_fsm emits the micro-op sequence.
The frontend outputs are multiplexed:
- frontend_bus_issue (normal path)
- frontend_bus_zcmp (uOp path)
  frontend_o selects between them via zcmp_uop_in_progress.

New signals from frontend to control-unit:

zcmp_in_uop_seq – Zcmp micro-op sequence active
zcmp_start – sequence begins next cycle
zcmp_atomic_tail – atomic tail section of the sequence (no traps allowed)

Control unit

On zcmp_start, the PC is latched. During the micro-op sequence exe_engine.pc / pc2 are held.
Trap logic is inhibited when zcmp_atomic_tail is asserted to honor POP/POPRET atomicity.

The trap/exception handling across the atomic tail likely needs refinement. I’d appreciate feedback.

Testing

I created test benches that execute every valid combination of the six Zcmp instructions and compare the affected architectural state against known-good outputs from the RISC-V Spike (riscv-isa-sim) reference simulator. For each sequence, the benches verify destination register values, stack-pointer updates, and control-flow effects (cm.popret/cm.popretz, including a0 zeroing).

The Hardware and Software implementation is based on the Verilog Zcmp implementation of the Hazard3 RISC-V CPU

Thanks for taking a look. I am happy to iterate based on your feedback.

stnolting · 2025-11-01T19:49:00Z

Hey @vogma, thanks for the PR and the detailed information!
I saw your work in the network view here on GitHub and followed it with great interest.

The implementation of this extension has been discussed in #633, but was closed as not planned.

That's right. To be honest, I'm not a fan of the Zcmp extension. An instruction that breaks down into many different instructions is somehow the opposite of what RISC originally stood for. But that's just me. 😉

Anyway, your results look pretty good, and I think that makes a discussion more than appropriate!

Synthesis results

+10% hardware costs - that's not bad for the code saving that you have provided!
Can you tell us anything about the maximum clock speed? Does the new extension affect the critical path?

Code-size impact

That's not bad at all. But code size isn't that important to me here. Since many instructions are replaced by a single (16-bit wide) instruction, this should also have a noticeable impact on bus traffic. Have you checked whether, for example, the processor check or Coremark run a little faster as a result?

Testing

That's really great work! I hope the riscv-arch-test TG comes up with some similar tests one day!

Did you check exception handling? What happens if any of the resulting memory instructions cause an exception? What happens when a debug request arrives during a Zcmp instruction?

(This is another point that bothers me about the Zcmp extension: exceptions are no longer precise and can no longer be assigned to a single operation.)

Frontend

I noticed that your code affects many modules in the CPU. Have you checked whether the entire code is optimized away when the ISA extension is disabled?

As I said, I'm not a fan of the Zcmp extension. But I also think that this should be left up to the end user. So I think your PR is good!

I'm just wondering if the RTL changes could be condensed a little more. Basically, you need an FSM between the issue logic and the CPU's execution microsequencer. As soon as a Zcmp instruction is executed, the PC of the execution unit is stalled and the update signal to the issue logic or instruction prefetch buffer is suppressed. The FSM can then send as many instructions as it likes to the execution unit.

So I'm wondering if we could move most of the Zcmp logic into a new module that would be placed within the fron-end bus system. 🤔

vogma · 2025-11-03T15:27:22Z

Thank you for the feedback @stnolting

To be honest, I'm not a fan of the Zcmp extension. An instruction that breaks down into many different instructions is somehow the opposite of what RISC originally stood for. But that's just me. 😉

I agree it's quite different from all the other available RISC-V extensions. I only chose this extensions because it hasn't been implemented yet into this cpu core and i wanted to work with the NEORV32 :)

Can you tell us anything about the maximum clock speed? Does the new extension affect the critical path?

I haven't done extensive tests for Fmax yet, but when synthesized with a 100MHz clock signal the zcmp-enabled implementation has a worst-negative-slack (WNS) of +1.112 ns which is a bit better than without the zcmp extension (WNS=+1.03ns).

Since many instructions are replaced by a single (16-bit wide) instruction, this should also have a noticeable impact on bus traffic. Have you checked whether, for example, the processor check or Coremark run a little faster as a result?

Thats a good point. I just noticed that in my branch the coremark executable throws a illegal instruction exception since the latest merge so i will have to investigate that before i can answer your question.

Did you check exception handling? What happens if any of the resulting memory instructions cause an exception? What happens when a debug request arrives during a Zcmp instruction?

Only in theory. The specification states that if the zcmp micro-op instruction sequence is not inside the atomic tail section, the sequence can be aborted and retried. If a trap occurs during these memory operations, the zcmp sequence is aborted. But i have not tested if the sequence is retried after that.

The behaviour during debugger requests is a great point. Thanks for bringing that up.

(This is another point that bothers me about the Zcmp extension: exceptions are no longer precise and can no longer be assigned to a single operation.)

True. Because the PC is held at the address of the Zcmp instruction, only this macro instruction will be flagged.

I noticed that your code affects many modules in the CPU. Have you checked whether the entire code is optimized away when the ISA extension is disabled?

Good point. I compared my implementation (Zcmp generic => false) with the main branch and 21 LUTs and 8 Registers are not optimized away by my implementation.

So I'm wondering if we could move most of the Zcmp logic into a new module that would be placed within the fron-end bus system. 🤔

Yes i will move as much logic as possible into a separate module. This will probably also fix the unaccounted LUTs and Registers from above.

Thank you for your feedback. I gained some valuable information and will work on my implementation in the coming weeks.

During a double move instruction, exception handling is suspended until the instruction finishes but the restart on branch functionality has to be implemented. This implements the restart on branch during double moves

stnolting · 2025-11-15T15:48:26Z

I only chose this extensions because it hasn't been implemented yet into this cpu core and i wanted to work with the NEORV32 :)

That's cool, and I really appreciate it!

I haven't done extensive tests for Fmax yet, but when synthesized with a 100MHz clock signal the zcmp-enabled implementation has a worst-negative-slack (WNS) of +1.112 ns which is a bit better than without the zcmp extension (WNS=+1.03ns).

That sounds good!

Yes i will move as much logic as possible into a separate module. This will probably also fix the unaccounted LUTs and Registers from above.

Thanks a lot! Let me know if I can help in any way.

vogma added 30 commits September 2, 2025 12:10

new start

013606a

zcmp simulation

3d9c27a

added fsm for micro ops

5335103

work

35fad9a

simulation

d85e6f3

added cm.push example program

0f83962

working cm.push simulation

1a05018

working cm.push simulation. bug fixed

efa7bcd

Merge branch 'stnolting:main' into zcmp_extension

b87e99a

modified control unit. Not working

5c19950

cm.push working(?). more tests needed

1666e05

working cm.push on fpga

08bfdb5

zcmp functionality in generate

d0ed3b3

added test code

3b05450

renamed test file

67d2d23

added cm.push support

d24ac33

fixed cm.pop bug

fc2bcd5

added necessary instruction signals for popret and popretz

9f198cf

fixed bug in cm.pop

0bcf277

software tests

b5331bb

ls

6571170

Merge branch 'main' into zcmp_extension

5801d83

frontend old

bef4ee4

zcmp back

04ed917

removed redundancy in control. Added popret support

86779d8

popretz working

b314ee9

double move implemented

88fd990

almost working

97c2a21

commit

ef45182

fixed bug

41da15d

vogma added 22 commits October 7, 2025 10:02

Merge branch 'main' into zcmp_extension

d9ac737

added cm.push tests

25ebfde

added double move test

995ad31

added a0 printout

a8cffef

added mvsa01 test

acaf33c

added self checking test for mvsa01

d3c0c3f

removed old tests

86e0ce9

self checking tb for mvsa01

093cea0

added push self checking tb

22bfcc7

added pop self checking tests

4265e26

zcmp working, all tests passing

a3846ad

refined implementation. VHDL-2008 req removed

185eccb

test bench

944bbb9

:Merge branch 'main' into zcmp_extension

771816f

refactoring

3074c88

if_bus_t modified

6dbe4c0

Merge branch 'stnolting:main' into zcmp_extension

64cee61

Merge branch 'main' into zcmp_extension

3b99cd6

refactoring, comments

dc6954e

removed unnecessary modifications

9f86502

removed unused signal

96cd40b

Merge branch 'main' into zcmp_extension

31f999d

Merge branch 'main' into zcmp_extension

578019e

vogma added 2 commits November 8, 2025 20:18

fixed bug in double moves.

c8773b0

During a double move instruction, exception handling is suspended until the instruction finishes but the restart on branch functionality has to be implemented. This implements the restart on branch during double moves

Merge branch 'main' into zcmp_extension

da485eb

vogma added 2 commits November 30, 2025 13:19

Merge branch 'main' into zcmp_extension

c52a701

fix

c960292

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP/RFC: Add RISC-V `Zcmp` code-size reduction extension to NEORV32 #1417

WIP/RFC: Add RISC-V `Zcmp` code-size reduction extension to NEORV32 #1417

vogma commented Oct 30, 2025

Uh oh!

stnolting commented Nov 1, 2025

Uh oh!

vogma commented Nov 3, 2025 •

edited

Loading

Uh oh!

stnolting commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WIP/RFC: Add RISC-V Zcmp code-size reduction extension to NEORV32 #1417

Are you sure you want to change the base?

WIP/RFC: Add RISC-V Zcmp code-size reduction extension to NEORV32 #1417

Conversation

vogma commented Oct 30, 2025

Synthesis results (Vivado 2025.1)

Code-size impact (sw/example executables)

Short overview (Zcmp)

Implementation notes

Frontend

Control unit

Testing

Uh oh!

stnolting commented Nov 1, 2025

Uh oh!

vogma commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stnolting commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WIP/RFC: Add RISC-V `Zcmp` code-size reduction extension to NEORV32 #1417

WIP/RFC: Add RISC-V `Zcmp` code-size reduction extension to NEORV32 #1417

Code-size impact (`sw/example` executables)

vogma commented Nov 3, 2025 •

edited

Loading