-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Initial nearly-minimal changes to assembler for Pickled Canary #5801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This starts to address issue NationalSecurityAgency#5800.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you're still working on this draft, but I have some comments for you in its current state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a separate place for developer documentation like this. We'll have to decide whether this is suitable for inclusion in the repo and then where to actually put it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this file (and other documentation) until we know where it should go. It's still available here: https://github.com/mitre/pickled-canary/tree/main/src/main/java/org/mitre/pickledcanary/assembler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the right place for this kind of documentation. Also, assuming this is built from the corresponding .txt file, this render should probably not be included in the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this file (and other documentation) until we know where it should go. It's still available here: https://github.com/mitre/pickled-canary/tree/main/src/main/java/org/mitre/pickledcanary/assembler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to decide if/where to place this in the repo. Also, we'd probably want some task in our gradle scripts to (re-)render it to .svg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this file (and other documentation) until we know where it should go. It's still available here: https://github.com/mitre/pickled-canary/tree/main/src/main/java/org/mitre/pickledcanary/assembler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this file (and perhaps the other documentation files) were copied into this (draft) PR in haste. This file does not likely apply to everything in the package containing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this file (and other documentation) until we know where it should go. It's still available here: https://github.com/mitre/pickled-canary/tree/main/src/main/java/org/mitre/pickledcanary/assembler
This is a modified version of Ghidra's (v10.2) assembler package for Pickled | ||
Canary. | ||
|
||
Changes Copyright (C) 2023 The MITRE Corporation All Rights Reserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naturally, we become concerned about this line, esp., since changes will become muddled in future history. The changes you submit must be releasable under the Apache Software License 2.0, so at least those rights must be granted. If you're just looking for credit for your contributions, there's probably a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working with others on the team here to get clarity on this from our end. Thanks for your patience!
.filter(WildcardNode.class::isInstance) | ||
.map(WildcardNode.class::cast) // to WildcardNodes | ||
.peek(node -> { | ||
String nodeName = node.getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worthwhile to create a method and refer to it rather than using a lambda here, just to avoid the deep indentation. Additionally, consider using a text-block string, instead of the multi-line string concatenation, for the exception's message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should work on the assembler before we talk about inclusion of the "pickled canary" query language. That's definitely a nice use case, and seems like a good fit for inclusion in Ghidra, but two things comes to mind:
- This is the first we've seen someone needing to extend the assembler in a way its API cannot currently support. It'd be nice to design this in a more abstract sense to better accommodate additional extensions in the future. Wildcards are definitely useful, but your changes are introducing dependencies on types found in your query language. I'd rather design some interface that your implement. Generics may also be suitable, but those tend to make things complicated.
- Wildcards are useful just on their own, even without the surrounding query language. Getting one PR done with the extensions to the assembler would be a good milestone and clearly add a valuable feature. The query language you've designed is a much larger feature. Discussing all the design and code around it may take a good bit more time. I wouldn't want the wildcard feature to get held up because it's in the same PR as the query language.
Thus, I've neglected to comment much further on anything in your querylanguage
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds perfectly reasonable. I think it leads us to focus on the question of what exactly the extended assembler API should look like. As it stands I had to pull in the querylanguage
to have the non-string InstructionNode
as input to resolveLine
. Do you have an idea for what type of input you'd like to have to this function that's more flexible than simply a string? Or, do you want to keep it a string and do the parsing out of special wildcard syntax within the processing of resolveLine
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about standardizing the idea of a wildcard using your proposed syntax, i.e., the backticked label. So long as no processor language spec needs that character, we should be good. There's some nuance I suppose you've already run into, e.g., we probably shouldn't allow wildcarding of opcode mnemonics. Within the "wildcard," the client API can specify whatever language it wants. The assembler's only job will be to assemble and capture operand info within the wildcarded sub-tree.
Regarding whether to allow wildcarding of non-terminals (sleigh sub-tables), I think it's doable using the AssemblyHiddenConstructStateGenerator
. That said, I wouldn't recommend it. For some background the purpose of that class is to handle cases where a processor spec invokes a sub-table without displaying its print piece in the mnemonics: e.g., something like:
:MYOP opn1, opn2 is check_ctx opn1 opn2 [] {}
My syntax may be off a bit, but this happens commonly when a language needs to encapsulate some context pre-checks and/or transformations into a sub constructor. it'll invoke the constructor but not print it. Thus, the assembler has nothing to go on, so it just generates all the possibilities hoping it can solve one or more of them. Thus, it's capable of generating all sub-trees for wildcarded NTs.
The reason I don't recommend it is it's very slow, depending on what get's wildcarded. We actually had to patch up a couple of our language specs to avoid this situation or at least shrink the size of the sub-tree. A single instruction could easily take tens of seconds :( . If wildcarding only terminals suffices, we might just stick with it for now. Thoughts?
Regarding other options to pass to resolveLine
, you might look into resolveTree
. The idea there is you've already done the (potentially customized) parsing and are ready to process the semantics, i.e., generate the machine code.
Another thought: Your insight that operand info is in essence the same as backfill actually inspires another possible use case. A script could assemble a template, leaving say an address wildcarded. Then, to use that template to generate an instruction, only the operand needs to be filled in each time, instead of re-parsing and resolving the full tree each time. If I'm not mistaken, proving out that use case in our base assembler would provide all the features necessary for Pickled Canary to use it as a client without further extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So long as no processor language spec needs that character, we should be good.
I haven't done an exhaustive search of public processor language specs to see if this is the case, but I don't think it's a common character in assembly syntaxes. That being said, we could leave an escape hatch along the lines of python's %%
means a single %
in their format strings, so in our case perhaps two consecutive backticks reduces to a plain-text backtick.
Regarding whether to allow wildcarding of non-terminals (sleigh sub-tables)
I wonder if we can leave ourselves a path towards allowing this to be opt-in in the future though. I can see it being super useful for a number of use cases, even if it's slow (and if it's important enough, perhaps it'll be optimized in the future). The question is how to do this. If you're leaning towards only a "backticked label" then how would we (in the future) differentiate the two options?
you might look into
resolveTree
.
I'll take a look.
Another thought: Your ...
Sounds like that idea could certainly speed up some repeated-assembly use cases. I'd have to think about your last sentence some more to know 100%. The information we'd get back from the assembler in that case would have to include the mask(s) and expression to compute the possible fill-in (or read-out-of-the-bits) values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping a path open for wildcarded NTs sounds reasonable. As for distinguishing when to enable it, perhaps some prefix in the label itself, e.g., regular wildcard would be `Q1`
and a sub-tree (NT) wildcard could be `tree!Q1`
. I'm open to alternatives, if you have an idea.
Yes, I agree about keeping the masks and expressions, tagged with whatever the label was. All retrievable from the root AssemblyResult
. I think we're pretty close to an actual spec for assembler changes? Once we agree on one, I'll enter the internal tickets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bit of information on Pickled Canary's current wildcard syntax here: https://github.com/mitre/pickled-canary#wildcard-format The short of it that something like LEA RBP,[`Q1/R..` + 0x60]
will allow Q1 to only be populated with three-letter registers starting with "R". Another example (taken from here: https://github.com/mitre/pickled-canary/blob/main/example_patterns/CVE-2015-3145/mips_le_CVE-2015-3145_patched.ptn) is lb `Q1/v.`,0x0(`Q2/s.`)
which again constrains each of the loads to be only from a specific subset of registers. [There are some other more complicated examples, but I don't have them here with me at the moment, and they may be hard to share.]
In some instructions not enumerating all the other options saves a lot of time (especially if there are multiple wildcards in the instruction which exponentially grow the number of possible completions).
if we wish to avoid re-building the parser for different use cases
Would the callback approach avoid this. The parser just knows to outsource the backtick parsing?
If I'm following what you're saying, the generator side may also have a new interface that'd allow for things like filtering, but the question is how to get that information from the parsing side to the generating side. If the parsing side can return an object conforming to an interface which is later passed to the user-supplied (or perhaps default, depending on the use case) generator "callback/plugin" (for lack of a better term) the generator side could do whatever it'd like with the parsed-out metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outsourcing the parsing to a callback I think would only work for terminal-only wildcards. I almost certain such an interface would close off the idea of non-terminal wildcards later. The callback would be consuming the token stream without regard to how it affects the parser's internal state. This might even affect the terminal-only case, because I don't have a clear separation between lexer and parser (because of things like J^cc
representing a single mnemonic "token"). I'm not totally against it, I just don't want to hand you a foot gun.
I already see your AssemblyParseMachineWild
thing.... The original AssemblyParseMachine
was a complicated bit of code, and I think a better interface would not require an extension to replicate or understand it. While it still requires some expertise, I think grammar generation is a more suitable place for extensibility. (Maybe rebuilding the parser it is unavoidable for these extended cases.) For example, if the SLEIGH file specifies a production I => "ADD" RegOp MemOp
, and your extension decides MemOp
can be wildcarded, then you could add the production MemOp => WildCard
. This more naturally specifies the idea that any memory operand (no matter how many tokens are involved in it) can be replaced by a single wildcard token, and the parser generator will do all the right things to ensure neither Noam nor Alan are upset.
I might try my own hand at this. 1) To try to better communicate my ideas concretely. 2) To get an appreciation for the troubles you've been having. I'm thinking something like extending and overriding SleighAssemblerBuilder#buildSubGrammar(SubtableSymbol)
. (Perhaps I'll factor a method that might allow me to hook right before the subgrammar.addProduction()
line.) I can then insert any class that extends AssemblySymbol
into the grammar, which should provide me most of the control I need in the subsequent steps. I'll have to see where things go from there....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to scrap something together with some success, with not too much damage to the existing code. Most of my changes (aside from completely new files) are allowing overrides to inject custom classes. I'll see about pushing a branch out so that you can see what I've done before it's merged, and we can talk. I haven't worked out how to collect the operand info yet, but I've noticed that the AssemblyResolvedPatterns object has children. They just aren't accessible. I suspect we could use that mechanism to recover the info you need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! Looking forward to checking your changes out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for all the trouble. I've created a private fork and shared the branch with you in it. Let's continue all this over there.
gradle/support/ip.gradle
Outdated
@@ -135,6 +135,7 @@ def Map<String, List<String>> getIpForModule(Project p) { | |||
exclude "**/data/build.xml" // language build file (generated for dev only) | |||
exclude "**/.vs/**" | |||
exclude "**/*.vcxproj.user" | |||
exclude "**/assembler/Docs/**" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to raise more than my eyebrows on our team. Of course, this goes back to whether or not we should include that documentation in the repo. We've actually started some long term discussions about how to consolidate and normalize documentation like this among our team. We have stuff in Docbook, Markdown, HTML, LaTeX, beamer, .... The documentation you propose to include would introduce yet another. It's probably best not to poke that bear for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undone for now. :-)
@@ -37,7 +38,12 @@ public class TableEntryKey implements Comparable<TableEntryKey> { | |||
*/ | |||
public TableEntryKey(int state, AssemblySymbol sym) { | |||
this.state = state; | |||
this.sym = sym; | |||
if (sym instanceof AssemblyWildcardTerminal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Java 17, you can do sym instanceof AssemblyWildcardTerminal awt
, and skip the cast on the next line.
@@ -24,6 +24,7 @@ | |||
import ghidra.app.plugin.assembler.sleigh.sem.*; | |||
import ghidra.app.plugin.assembler.sleigh.symbol.AssemblyNumericSymbols; | |||
import ghidra.app.plugin.assembler.sleigh.util.DbgTimer; | |||
import ghidra.app.plugin.querylanguage.lexer.ast.InstructionNode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not see a dependency from assembler to your query language. We should try to design an interface instead.
…e rather than depending on querylanguage.
@plucia-mitre Is it worth to rebase on e7458ed? |
@jobermayr yes, absolutely, this work is already in progress! Thanks for the ping! :-) |
This starts to address issue #5800.
More code related to this change can be found here: https://github.com/mitre/pickled-canary including some integration-level tests and more of the surrounding infrastructure which would eventually use these changes.
I'm sure this code isn't yet perfect, but it works well enough for many use cases.
A specific area needing more attention is the handling of "context".