-
Notifications
You must be signed in to change notification settings - Fork 304
WIP/RFC: Add RISC-V Zcmp code-size reduction extension to NEORV32
#1417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hey @vogma, thanks for the PR and the detailed information!
That's right. To be honest, I'm not a fan of the Zcmp extension. An instruction that breaks down into many different instructions is somehow the opposite of what RISC originally stood for. But that's just me. 😉 Anyway, your results look pretty good, and I think that makes a discussion more than appropriate!
+10% hardware costs - that's not bad for the code saving that you have provided!
That's not bad at all. But code size isn't that important to me here. Since many instructions are replaced by a single (16-bit wide) instruction, this should also have a noticeable impact on bus traffic. Have you checked whether, for example, the processor check or Coremark run a little faster as a result?
That's really great work! I hope the riscv-arch-test TG comes up with some similar tests one day! Did you check exception handling? What happens if any of the resulting memory instructions cause an exception? What happens when a debug request arrives during a (This is another point that bothers me about the
I noticed that your code affects many modules in the CPU. Have you checked whether the entire code is optimized away when the ISA extension is disabled? As I said, I'm not a fan of the Zcmp extension. But I also think that this should be left up to the end user. So I think your PR is good! I'm just wondering if the RTL changes could be condensed a little more. Basically, you need an FSM between the issue logic and the CPU's execution microsequencer. As soon as a Zcmp instruction is executed, the PC of the execution unit is stalled and the update signal to the issue logic or instruction prefetch buffer is suppressed. The FSM can then send as many instructions as it likes to the execution unit. So I'm wondering if we could move most of the |
|
Thank you for the feedback @stnolting
I agree it's quite different from all the other available RISC-V extensions. I only chose this extensions because it hasn't been implemented yet into this cpu core and i wanted to work with the NEORV32 :)
I haven't done extensive tests for Fmax yet, but when synthesized with a 100MHz clock signal the zcmp-enabled implementation has a worst-negative-slack (WNS) of +1.112 ns which is a bit better than without the zcmp extension (WNS=+1.03ns).
Thats a good point. I just noticed that in my branch the coremark executable throws a illegal instruction exception since the latest merge so i will have to investigate that before i can answer your question.
Only in theory. The specification states that if the zcmp micro-op instruction sequence is not inside the atomic tail section, the sequence can be aborted and retried. If a trap occurs during these memory operations, the zcmp sequence is aborted. But i have not tested if the sequence is retried after that. The behaviour during debugger requests is a great point. Thanks for bringing that up.
True. Because the PC is held at the address of the Zcmp instruction, only this macro instruction will be flagged.
Good point. I compared my implementation (Zcmp generic => false) with the main branch and 21 LUTs and 8 Registers are not optimized away by my implementation.
Yes i will move as much logic as possible into a separate module. This will probably also fix the unaccounted LUTs and Registers from above. Thank you for your feedback. I gained some valuable information and will work on my implementation in the coming weeks. |
During a double move instruction, exception handling is suspended until the instruction finishes but the restart on branch functionality has to be implemented. This implements the restart on branch during double moves
That's cool, and I really appreciate it!
That sounds good!
Thanks a lot! Let me know if I can help in any way. |
This draft PR adds initial support for the ratified RISC-V Zc subset Zcmp (PUSH/POP and double-move macros) to NEORV32. The implementation of this extension has been discussed in #633, but was closed as not planned. I think this extension is interesting and so I gave the implementation a shot anyway. This is not ready to be merged but I would like to start a discussion on how the implementation can be refined. The current implementation executes all Zcmp instructions in simulation and on hardware. More specifically, the
processor_checkpasses in simulation and the examples insw/examplerun on my Arty A7-100T board.I have summarized some information below. I would appreciate any feedback :)
Synthesis results (Vivado 2025.1)
Baseline configuration:
rv32imc_zicntrWith Zcmp enabled:
rv32imc_zicntr_zcmpTop level
Delta (top): +193 LUTs (+9.95%), +43 FFs (+2.92%)
Code-size impact (
sw/exampleexecutables)processor_checkgame_of_lifecoremark(-Os)bus_explorerThe code size results are rather underwhelming for the example software shipped with the NEORV32. I guess because the examples are quite small, only a few cm.push and pop instructions can be generated by the compiler. I tried to compile the embench benchmark suite which was referenced in the Issue #633, but for now I have not been successful.
Short overview (Zcmp)
Zcmp adds six 16-bit macros:
cm.push/cm.popcm.popret/cm.popretzcm.mvsa01/cm.mva01sImplementation notes
Frontend
Two existing FSMs remain: fetch (fills two 16-bit FIFOs) and issue (forms valid instruction words).
New
uop_fsmis synthesized only whenRISCV_ISA_Zcmpgeneric is set.When the decompressor detects Zcmp,
issue_fsmenters a Zcmp state anduop_fsmemits the micro-op sequence.The frontend outputs are multiplexed:
frontend_bus_issue(normal path)frontend_bus_zcmp(uOp path)frontend_oselects between them viazcmp_uop_in_progress.New signals from frontend to control-unit:
zcmp_in_uop_seq– Zcmp micro-op sequence activezcmp_start– sequence begins next cyclezcmp_atomic_tail– atomic tail section of the sequence (no traps allowed)Control unit
zcmp_start, the PC is latched. During the micro-op sequenceexe_engine.pc/pc2are held.zcmp_atomic_tailis asserted to honor POP/POPRET atomicity.The trap/exception handling across the atomic tail likely needs refinement. I’d appreciate feedback.
Testing
I created test benches that execute every valid combination of the six Zcmp instructions and compare the affected architectural state against known-good outputs from the RISC-V Spike (riscv-isa-sim) reference simulator. For each sequence, the benches verify destination register values, stack-pointer updates, and control-flow effects (cm.popret/cm.popretz, including a0 zeroing).
The Hardware and Software implementation is based on the Verilog Zcmp implementation of the Hazard3 RISC-V CPU
Thanks for taking a look. I am happy to iterate based on your feedback.