Skip to content

initial implementation of the Sail-generated RISCV disassembler module #2498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: next
Choose a base branch
from

Conversation

moste00
Copy link

@moste00 moste00 commented Oct 4, 2024

Your checklist for this pull request

  • I've documented or updated the documentation of every API function and struct this PR changes.
  • I've added tests that prove my fix is effective or that my feature works (if possible)

Detailed description

This PR aims to replace the LLVM-derieved RISCV module with a Sail-derieved RISCV module. The generator tool is being developed here, and for the Sail model of RISCV is here.

Sail is an architecture description language being developed here, it's an imperative language inspired in syntax and semantics by OCaml, with some syntax sugar and innovative features designed specifically for describing computer architectures. See here for a detailed tour and explanation of major features.

The RISCV foundation has adopted the Sail model of RISCV as the "official" definition of the architecture, and therefore it's desirable to generate a C implementation of the any RISCV-related logic from the sail-riscv model, as it will be up-to-date and compliant by construction.

Test plan

The current state of the module doesn't compile, this will be updated as work continues on the module. The initial goal of the work is to be able to invoke cstool and obtain useful results (e.g. the instruction in string form, as a start). Hopefully this goal is not too far.

Closing issues

...

Copy link
Contributor

@wargio wargio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall is a very good progress, i would suggest tho to maybe split arch/RISCV/riscv_ast2str.gen.inc into multiple files since it too big, maybe split it by RV32 and RV64

@XVilka

Copy link
Contributor

@XVilka XVilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thestr4ng3r take a look too, please, when you have time.

Copy link
Collaborator

@Rot127 Rot127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the prototype/first implementation the decode is fine like this. But before we can merge it into next we need to optimize two things:

Size

      uint64_t rd = (binary_stream & 0x0000000000000F80)>>7 ;
      uint64_t rs1 = (binary_stream & 0x00000000000F8000)>>15 ;
      uint64_t rs2 = (binary_stream & 0x0000000001F00000)>>20 ;
      tree->ast_node_type = RISCV_RTYPE ;
      tree->ast_node.rtype.rs2 = rs2;
      tree->ast_node.rtype.rs1 = rs1;
      tree->ast_node.rtype.rd = rd;

These specific lines are repeated 10 times in the decoder.
I assume there are other decoding patterns happening just as often. In the final version we should not have any duplicated code in here.

Runtime complexity

I greped for ^ if and found 505 if cases in the decode function. This means for an illegal instructions it does at least 505 comparisons (assuming the compiler doesn't optimize something out). Which is something more than ~O(n * 10) (n = number of bits).
But we should reach in worst case O(n * 1) and O(log(n)) on average before we merge it to next.

The current structure is fine. Also because you have the RzIL task as well. So no worries.

What is important though, is that the decoded details (operand details) are stable. No matter how the architecture of this decoder is. Because on once you finished RzIL we would not want to refactor the whole RzIL work, just because we optimized the Capstone decoder :)

That said, good job! Looks like a lot of work! Well done!

@@ -0,0 +1,3 @@
#include "capstone.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "capstone.h"
#include <capstone/capstone.h>

RISCV_AMOMAXU
} op;

uint8_t aq /* bits : 1 */;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These /* bits : 1 */ comments, what do they mean?

  1. aq encodes bit 1 of instruction.
  2. aq is one bit wide.

Please make this more clear. E.g. for the first meaning you could replace it with insn_bits[1:1]. And for the second meaning: bit_width : 1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess it means the last, though being more descriptive doesn't hurt. This one has a low priority though.

@XVilka
Copy link
Contributor

XVilka commented Oct 23, 2024

@moste00 please update the PR with your latest state of the generated code


RISCV_INS_ENDING,
} riscv_insn;
#include "riscv_insn.gen.inc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not ok, but we can write a script to update this.

Copy link
Collaborator

@Rot127 Rot127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much like what I see! Awesome job!
Especially the changes to riscv_decode.gen.inc :)

Please focus on just making it work. You can ignore my comments for now. They are just there so we don't forget about it.

Because I could only take a shallow look, I'll check again in the next days.

@@ -2,26 +2,38 @@
/* RISC-V Backend By Rodrigo Cortes Porto <[email protected]> &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set your copyright. Please use SPDX header style:

# Copyright © 2022 Rot127 <[email protected]>
# SPDX-License-Identifier: BSD-3

#include <string.h>

enum riscv_insn {
//--------------------- RISCV_REV8---------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe think about converting these into Doxygen, if it's ever possible?

*ps = " , "; \
*plen = 3

static inline void hex_bits(uint64_t bitvec, char **s, size_t *len,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Doxygen comments for these helper functions. Also, maybe move to the capstone utils instead? cc @Rot127

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please move it to `utils.c

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you can just use sprintf(). We depend on libc anyways.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, ignore it. See my comment in the final review message.

Copy link
Collaborator

@Rot127 Rot127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • For all the string operations: It is better to use SStream everywhere. And don't do direct operations on char *. This is what it was implement for anyways. Is better tested and is convenient to use.

Edit: Sorry, pressed the "review" button by accident. Will add some more comments.

*ps = " , "; \
*plen = 3

static inline void hex_bits(uint64_t bitvec, char **s, size_t *len,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please move it to `utils.c

*ps = " , "; \
*plen = 3

static inline void hex_bits(uint64_t bitvec, char **s, size_t *len,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you can just use sprintf(). We depend on libc anyways.

*ps = " , "; \
*plen = 3

static inline void hex_bits(uint64_t bitvec, char **s, size_t *len,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, ignore it. See my comment in the final review message.

Copy link
Collaborator

@Rot127 Rot127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! This is a really nice basis to develop further.
I just pushed a new experimental branch (based on newest next). Please rebase your PR on top of it.


typedef struct riscv_conf {
Void2Bool sys_enable_fdext;
Void2Bool sys_enable_zfinx;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doxygen please. Also for the Void2Bool callback.

@moste00 moste00 force-pushed the riscv_disassembly_using_sail branch from eff6b64 to 3e533b8 Compare February 14, 2025 21:06
@moste00 moste00 requested a review from Rot127 February 14, 2025 21:07
}
str_len += 2; // for the '0x' in the beginning

CS_ASSERT(str_len > 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CS_ASSERT(str_len > 0);

Always true.

Comment on lines +46 to +83
#define DEF_HEX_BITS(n) \
static inline void hex_bits_##n(uint64_t bitvec, SStream *ss, \
RVContext *ctx) { \
hex_bits(bitvec, n, ss, ctx); \
}

DEF_HEX_BITS(1)
DEF_HEX_BITS(2)
DEF_HEX_BITS(3)
DEF_HEX_BITS(4)
DEF_HEX_BITS(5)
DEF_HEX_BITS(6)
DEF_HEX_BITS(7)
DEF_HEX_BITS(8)
DEF_HEX_BITS(9)
DEF_HEX_BITS(10)
DEF_HEX_BITS(11)
DEF_HEX_BITS(12)
DEF_HEX_BITS(13)
DEF_HEX_BITS(14)
DEF_HEX_BITS(15)
DEF_HEX_BITS(16)
DEF_HEX_BITS(17)
DEF_HEX_BITS(18)
DEF_HEX_BITS(19)
DEF_HEX_BITS(20)
DEF_HEX_BITS(21)
DEF_HEX_BITS(22)
DEF_HEX_BITS(23)
DEF_HEX_BITS(24)
DEF_HEX_BITS(25)
DEF_HEX_BITS(26)
DEF_HEX_BITS(27)
DEF_HEX_BITS(28)
DEF_HEX_BITS(29)
DEF_HEX_BITS(30)
DEF_HEX_BITS(31)
DEF_HEX_BITS(32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generates too many functions :D
If you have hex_bits_X() in the generated code, it is better to pass the number of bits as argument.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for sleeping on this review. Was finishing up the generator.

Yeah there is a lot of those hex_bits_* functions but their implementations are all simple one-liners that delegate to their mother function, which does take a bit width parameter. I guess I'm doing this in direct imitation to the sail implementation, which does something very similar using mappings.

I can't control how it's called from the generated code because the generator doesn't understand anything about hex_bits_X, it just sees an opaque function that it couldn't derieve automatically (only very simple mapping are parsed and understood as basically string->string tables), so it just emits the call as-is with all the arguments plus an SS buffer and a context argument. It's up to some human-in-the-loop to implement those opaque functions so that the calls make sense. I implemented them as thin wrappers over a general function which does take the bit width parameter.

Comment on lines +33 to +36
char digit = (bitvec & 0xF) + 48;
if (digit > '9') {
digit += ('a' - ':');
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be done with something like "0123456789abcdef"[bitvec & 0xf].

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks

Comment on lines +153 to +156
if (ma) {
SStream_concat(ss, "ma");
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (ma) {
SStream_concat(ss, "ma");
return;
}
if (!ma) {
return;
}
SStream_concat(ss, "ma");

General rule, returning early from errors makes the code look cleaner.

#include <stdint.h>

typedef uint8_t (*Void2Bool)(void);
typedef uint8_t RVBool;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be only used as bool. Better replace RVBool with bool.
Otherwise the compiler can't apply possible optimizations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there a standard bool in the C version we're using though? I thought C pre C99 doesn't have bools and even C99 fakes it by a typedef, only C11 or something later introduces a true bool.

Is it okay if I used C99/C11 though knowing that Capstone is compiled for so many architectures and by so many compilers? I remember that I had to change the generator code using binary number literals some time ago because some compilers don't support it.


// VERY HACKY: use op_str as a temporary buffer to serialize the instruction struct
// so that the printer callback can later de-serialize it in order to stringify it
// alternatives: dynamic memory, global/static variables,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global/static variables

Highly advice to never get started with them :D

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah :'D, just wanted to throw all the alternatives out there, wasn't thinking seriously about them.

I guess what I'm doing is kinda okay and isn't that brittle because I'm not serializing that binary buffer anywhere or anytime else, I'm just binary-dumping a struct then reviving it from the binary dump in another function of the same running program. It's still very hacky and could make a lot of people go "wtf" when reading it, but I legit thought about it so hard for days and couldn't find anything better than that or malloc, and I really hate malloc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants