Skip to content
This repository was archived by the owner on Jan 7, 2020. It is now read-only.

The matching system explained

Andrea Fioraldi edited this page Jan 29, 2018 · 6 revisions

If you look inside a json file produced by Carbonara-CLI you can see that for every analyzed procedure there are two fields: vexhash and flowhash.

They are MinHash objects used for the procedure matching system in Carbonara.

From the procedure code to the hash

  1. recognition of jump and call instructions
  2. api recognition

ex. if 0x4000 is the address of the import table entry printf, in the internal state of the translator the address is associated to the API:printf string

  1. addresses abstraction

ex. each address targeted by a jump outside the procedure is associated to a number that keeps the order

JMP 0x4010           OUT:1
CALL 0x4000   --->   OUT:0
JMP 0x4200           OUT:2
  1. inside jumps are used to construct a basic blocks graph, the relative address are associated with the number of hops in the graph to reach the jump target

  2. flowhash is the minhash of the list generated in the 3 previous steps, where are reported jumps and calls to the associated abstracted addreses in order of comparison.

ex. (considering a function at address 0x6000 and puts got entry at 0x1000)

0x6000: CALL 0x4800          CALL OUT:1
0x6003: ...
 ...
0x6010: JLE 0x6033           JMP 2
 ... 
0x6030: JMP 0x4200   --->    JMP OUT:0
0x6033: ...
 ...
0x6060: CALL 0x1000          CALL API:puts
0x6063: JMP 0x6003           JMP 5
0x6066: RET
  1. instructions not supported by the VEX IR (like the x86 hlt) are substituted with placeholders
  2. VEX IR lifting (IR borrowed from Valgrind)
  3. register juice (indexing based on register order of appearance)

ex. (i use an assembly-like language because VEX is not very readable)

MOV EAX, 3            MOV R0, 3           MOV ECX, 3
POP EBX        --->   POP R1       <---   POP EDI
ADD EAX, EBX          ADD R0, R1          ADD ECX, EDI
RET                   RET                 RET
  1. substitute the addresses with the abstracted addresses used to build flowhash
  2. substitute addresses corresponding to the program counter incrementation in the fetch phase with a placeholder

ex. PUT(_PC_) = 0x6040 ---> PUT(_PC_) = _NEXT_

  1. group vex instructions in bi-grams to maintain a partial order relationship

  2. vexhash is the minhash representing the set of bigrams.

Clone this wiki locally