Initial nearly-minimal changes to assembler for Pickled Canary #5801

plucia-mitre · 2023-09-18T21:39:27Z

This starts to address issue #5800.

More code related to this change can be found here: https://github.com/mitre/pickled-canary including some integration-level tests and more of the surrounding infrastructure which would eventually use these changes.

I'm sure this code isn't yet perfect, but it works well enough for many use cases.

A specific area needing more attention is the handling of "context".

This starts to address issue NationalSecurityAgency#5800.

nsadeveloper789

Not sure if you're still working on this draft, but I have some comments for you in its current state.

nsadeveloper789 · 2023-10-10T17:19:31Z

...rc/main/java/ghidra/app/plugin/assembler/Docs/The New Ghidra Assembler and Modifications.pdf

We have a separate place for developer documentation like this. We'll have to decide whether this is suitable for inclusion in the repo and then where to actually put it.

I've removed this file (and other documentation) until we know where it should go. It's still available here: https://github.com/mitre/pickled-canary/tree/main/src/main/java/org/mitre/pickledcanary/assembler

nsadeveloper789 · 2023-10-10T17:20:29Z

...oftwareModeling/src/main/java/ghidra/app/plugin/assembler/Docs/assemblerCallTraceDiagram.svg

This is not the right place for this kind of documentation. Also, assuming this is built from the corresponding .txt file, this render should probably not be included in the repo.

I've removed this file (and other documentation) until we know where it should go. It's still available here: https://github.com/mitre/pickled-canary/tree/main/src/main/java/org/mitre/pickledcanary/assembler

nsadeveloper789 · 2023-10-10T17:21:30Z

...oftwareModeling/src/main/java/ghidra/app/plugin/assembler/Docs/assemblerCallTraceDiagram.txt

We'll need to decide if/where to place this in the repo. Also, we'd probably want some task in our gradle scripts to (re-)render it to .svg.

I've removed this file (and other documentation) until we know where it should go. It's still available here: https://github.com/mitre/pickled-canary/tree/main/src/main/java/org/mitre/pickledcanary/assembler

nsadeveloper789 · 2023-10-10T17:24:08Z

Ghidra/Framework/SoftwareModeling/src/main/java/ghidra/app/plugin/assembler/README.md

I'm assuming this file (and perhaps the other documentation files) were copied into this (draft) PR in haste. This file does not likely apply to everything in the package containing it.

I've removed this file (and other documentation) until we know where it should go. It's still available here: https://github.com/mitre/pickled-canary/tree/main/src/main/java/org/mitre/pickledcanary/assembler

nsadeveloper789 · 2023-10-10T17:32:07Z

Ghidra/Framework/SoftwareModeling/src/main/java/ghidra/app/plugin/assembler/README.md

+This is a modified version of Ghidra's (v10.2) assembler package for Pickled
+Canary.
+
+Changes Copyright (C) 2023 The MITRE Corporation All Rights Reserved


Naturally, we become concerned about this line, esp., since changes will become muddled in future history. The changes you submit must be releasable under the Apache Software License 2.0, so at least those rights must be granted. If you're just looking for credit for your contributions, there's probably a better way.

I'm working with others on the team here to get clarity on this from our end. Thanks for your patience!

nsadeveloper789 · 2023-10-10T18:29:43Z

...ramework/SoftwareModeling/src/main/java/ghidra/app/plugin/querylanguage/lexer/ParseTree.java

+				.filter(WildcardNode.class::isInstance)
+				.map(WildcardNode.class::cast) // to WildcardNodes
+				.peek(node -> {
+					String nodeName = node.getName();


It might be worthwhile to create a method and refer to it rather than using a lambda here, just to avoid the deep indentation. Additionally, consider using a text-block string, instead of the multi-line string concatenation, for the exception's message.

nsadeveloper789 · 2023-10-10T18:42:25Z

Ghidra/Framework/SoftwareModeling/src/main/java/ghidra/app/plugin/querylanguage/Defaults.java

I think we should work on the assembler before we talk about inclusion of the "pickled canary" query language. That's definitely a nice use case, and seems like a good fit for inclusion in Ghidra, but two things comes to mind:

This is the first we've seen someone needing to extend the assembler in a way its API cannot currently support. It'd be nice to design this in a more abstract sense to better accommodate additional extensions in the future. Wildcards are definitely useful, but your changes are introducing dependencies on types found in your query language. I'd rather design some interface that your implement. Generics may also be suitable, but those tend to make things complicated.

Wildcards are useful just on their own, even without the surrounding query language. Getting one PR done with the extensions to the assembler would be a good milestone and clearly add a valuable feature. The query language you've designed is a much larger feature. Discussing all the design and code around it may take a good bit more time. I wouldn't want the wildcard feature to get held up because it's in the same PR as the query language.

Thus, I've neglected to comment much further on anything in your querylanguage package.

This sounds perfectly reasonable. I think it leads us to focus on the question of what exactly the extended assembler API should look like. As it stands I had to pull in the querylanguage to have the non-string InstructionNode as input to resolveLine. Do you have an idea for what type of input you'd like to have to this function that's more flexible than simply a string? Or, do you want to keep it a string and do the parsing out of special wildcard syntax within the processing of resolveLine?

I'm thinking about standardizing the idea of a wildcard using your proposed syntax, i.e., the backticked label. So long as no processor language spec needs that character, we should be good. There's some nuance I suppose you've already run into, e.g., we probably shouldn't allow wildcarding of opcode mnemonics. Within the "wildcard," the client API can specify whatever language it wants. The assembler's only job will be to assemble and capture operand info within the wildcarded sub-tree.

Regarding whether to allow wildcarding of non-terminals (sleigh sub-tables), I think it's doable using the AssemblyHiddenConstructStateGenerator. That said, I wouldn't recommend it. For some background the purpose of that class is to handle cases where a processor spec invokes a sub-table without displaying its print piece in the mnemonics: e.g., something like:

:MYOP opn1, opn2 is check_ctx opn1 opn2 [] {}

My syntax may be off a bit, but this happens commonly when a language needs to encapsulate some context pre-checks and/or transformations into a sub constructor. it'll invoke the constructor but not print it. Thus, the assembler has nothing to go on, so it just generates all the possibilities hoping it can solve one or more of them. Thus, it's capable of generating all sub-trees for wildcarded NTs.

The reason I don't recommend it is it's very slow, depending on what get's wildcarded. We actually had to patch up a couple of our language specs to avoid this situation or at least shrink the size of the sub-tree. A single instruction could easily take tens of seconds :( . If wildcarding only terminals suffices, we might just stick with it for now. Thoughts?

Regarding other options to pass to resolveLine, you might look into resolveTree. The idea there is you've already done the (potentially customized) parsing and are ready to process the semantics, i.e., generate the machine code.

Another thought: Your insight that operand info is in essence the same as backfill actually inspires another possible use case. A script could assemble a template, leaving say an address wildcarded. Then, to use that template to generate an instruction, only the operand needs to be filled in each time, instead of re-parsing and resolving the full tree each time. If I'm not mistaken, proving out that use case in our base assembler would provide all the features necessary for Pickled Canary to use it as a client without further extension.

So long as no processor language spec needs that character, we should be good.

I haven't done an exhaustive search of public processor language specs to see if this is the case, but I don't think it's a common character in assembly syntaxes. That being said, we could leave an escape hatch along the lines of python's %% means a single % in their format strings, so in our case perhaps two consecutive backticks reduces to a plain-text backtick.

Regarding whether to allow wildcarding of non-terminals (sleigh sub-tables)

I wonder if we can leave ourselves a path towards allowing this to be opt-in in the future though. I can see it being super useful for a number of use cases, even if it's slow (and if it's important enough, perhaps it'll be optimized in the future). The question is how to do this. If you're leaning towards only a "backticked label" then how would we (in the future) differentiate the two options?

you might look into resolveTree.

I'll take a look.

Another thought: Your ...

Sounds like that idea could certainly speed up some repeated-assembly use cases. I'd have to think about your last sentence some more to know 100%. The information we'd get back from the assembler in that case would have to include the mask(s) and expression to compute the possible fill-in (or read-out-of-the-bits) values.

Keeping a path open for wildcarded NTs sounds reasonable. As for distinguishing when to enable it, perhaps some prefix in the label itself, e.g., regular wildcard would be `Q1` and a sub-tree (NT) wildcard could be `tree!Q1`. I'm open to alternatives, if you have an idea.

Yes, I agree about keeping the masks and expressions, tagged with whatever the label was. All retrievable from the root AssemblyResult. I think we're pretty close to an actual spec for assembler changes? Once we agree on one, I'll enter the internal tickets.

There's a bit of information on Pickled Canary's current wildcard syntax here: https://github.com/mitre/pickled-canary#wildcard-format The short of it that something like LEA RBP,[`Q1/R..` + 0x60] will allow Q1 to only be populated with three-letter registers starting with "R". Another example (taken from here: https://github.com/mitre/pickled-canary/blob/main/example_patterns/CVE-2015-3145/mips_le_CVE-2015-3145_patched.ptn) is lb `Q1/v.`,0x0(`Q2/s.`) which again constrains each of the loads to be only from a specific subset of registers. [There are some other more complicated examples, but I don't have them here with me at the moment, and they may be hard to share.]

In some instructions not enumerating all the other options saves a lot of time (especially if there are multiple wildcards in the instruction which exponentially grow the number of possible completions).

if we wish to avoid re-building the parser for different use cases

Would the callback approach avoid this. The parser just knows to outsource the backtick parsing?

If I'm following what you're saying, the generator side may also have a new interface that'd allow for things like filtering, but the question is how to get that information from the parsing side to the generating side. If the parsing side can return an object conforming to an interface which is later passed to the user-supplied (or perhaps default, depending on the use case) generator "callback/plugin" (for lack of a better term) the generator side could do whatever it'd like with the parsed-out metadata

Outsourcing the parsing to a callback I think would only work for terminal-only wildcards. I almost certain such an interface would close off the idea of non-terminal wildcards later. The callback would be consuming the token stream without regard to how it affects the parser's internal state. This might even affect the terminal-only case, because I don't have a clear separation between lexer and parser (because of things like J^cc representing a single mnemonic "token"). I'm not totally against it, I just don't want to hand you a foot gun.

I already see your AssemblyParseMachineWild thing.... The original AssemblyParseMachine was a complicated bit of code, and I think a better interface would not require an extension to replicate or understand it. While it still requires some expertise, I think grammar generation is a more suitable place for extensibility. (Maybe rebuilding the parser it is unavoidable for these extended cases.) For example, if the SLEIGH file specifies a production I => "ADD" RegOp MemOp, and your extension decides MemOp can be wildcarded, then you could add the production MemOp => WildCard. This more naturally specifies the idea that any memory operand (no matter how many tokens are involved in it) can be replaced by a single wildcard token, and the parser generator will do all the right things to ensure neither Noam nor Alan are upset.

I might try my own hand at this. 1) To try to better communicate my ideas concretely. 2) To get an appreciation for the troubles you've been having. I'm thinking something like extending and overriding SleighAssemblerBuilder#buildSubGrammar(SubtableSymbol). (Perhaps I'll factor a method that might allow me to hook right before the subgrammar.addProduction() line.) I can then insert any class that extends AssemblySymbol into the grammar, which should provide me most of the control I need in the subsequent steps. I'll have to see where things go from there....

I was able to scrap something together with some success, with not too much damage to the existing code. Most of my changes (aside from completely new files) are allowing overrides to inject custom classes. I'll see about pushing a branch out so that you can see what I've done before it's merged, and we can talk. I haven't worked out how to collect the operand info yet, but I've noticed that the AssemblyResolvedPatterns object has children. They just aren't accessible. I suspect we could use that mechanism to recover the info you need.

Very cool! Looking forward to checking your changes out!

Sorry for all the trouble. I've created a private fork and shared the branch with you in it. Let's continue all this over there.

nsadeveloper789 · 2023-10-10T18:47:04Z

gradle/support/ip.gradle

@@ -135,6 +135,7 @@ def Map<String, List<String>> getIpForModule(Project p) {
 		exclude "**/data/build.xml" // language build file (generated for dev only)
 		exclude "**/.vs/**"
 		exclude "**/*.vcxproj.user"
+		exclude "**/assembler/Docs/**"


This is going to raise more than my eyebrows on our team. Of course, this goes back to whether or not we should include that documentation in the repo. We've actually started some long term discussions about how to consolidate and normalize documentation like this among our team. We have stuff in Docbook, Markdown, HTML, LaTeX, beamer, .... The documentation you propose to include would introduce yet another. It's probably best not to poke that bear for this PR.

Undone for now. :-)

nsadeveloper789 · 2023-10-10T19:00:58Z

...rk/SoftwareModeling/src/main/java/ghidra/app/plugin/assembler/sleigh/util/TableEntryKey.java

@@ -37,7 +38,12 @@ public class TableEntryKey implements Comparable<TableEntryKey> {
 	 */
 	public TableEntryKey(int state, AssemblySymbol sym) {
 		this.state = state;
-		this.sym = sym;
+		if (sym instanceof AssemblyWildcardTerminal) {


In Java 17, you can do sym instanceof AssemblyWildcardTerminal awt, and skip the cast on the next line.

nsadeveloper789 · 2023-10-10T19:03:19Z

...ework/SoftwareModeling/src/main/java/ghidra/app/plugin/assembler/sleigh/SleighAssembler.java

@@ -24,6 +24,7 @@
 import ghidra.app.plugin.assembler.sleigh.sem.*;
 import ghidra.app.plugin.assembler.sleigh.symbol.AssemblyNumericSymbols;
 import ghidra.app.plugin.assembler.sleigh.util.DbgTimer;
+import ghidra.app.plugin.querylanguage.lexer.ast.InstructionNode;


I'd rather not see a dependency from assembler to your query language. We should try to design an interface instead.

…e rather than depending on querylanguage.

jobermayr · 2024-01-10T20:31:59Z

@plucia-mitre Is it worth to rebase on e7458ed?

plucia-mitre · 2024-01-11T00:35:52Z

@jobermayr yes, absolutely, this work is already in progress! Thanks for the ping! :-)

plucia-mitre · 2024-01-15T18:02:16Z

Closing this request in favor of #6118 which builds off of e7458ed towards the same goal.

Initial nearly-minimal changes to assembler for Pickled Canary

abb2a86

This starts to address issue NationalSecurityAgency#5800.

ryanmkurtz assigned nsadeveloper789 Sep 20, 2023

ryanmkurtz added the Status: Triage Information is being gathered label Sep 20, 2023

nsadeveloper789 reviewed Oct 10, 2023

View reviewed changes

plucia-mitre added 2 commits October 18, 2023 15:06

Remove developer docs from this branch

5f8c474

First naive pass at refactoring to a "WildcardedInstruction" interfac…

5a1088f

…e rather than depending on querylanguage.

plucia-mitre mentioned this pull request Jan 15, 2024

Add a WildcardAssembler Plugin. #6118

Closed

plucia-mitre closed this Jan 15, 2024

ryanmkurtz added Reason: OBE This has been "overcome by events." and removed Status: Triage Information is being gathered labels Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial nearly-minimal changes to assembler for Pickled Canary #5801

Initial nearly-minimal changes to assembler for Pickled Canary #5801

plucia-mitre commented Sep 18, 2023

nsadeveloper789 left a comment

nsadeveloper789 Oct 10, 2023

plucia-mitre Oct 18, 2023

nsadeveloper789 Oct 10, 2023

plucia-mitre Oct 18, 2023

nsadeveloper789 Oct 10, 2023

plucia-mitre Oct 18, 2023

nsadeveloper789 Oct 10, 2023

plucia-mitre Oct 18, 2023

nsadeveloper789 Oct 10, 2023

plucia-mitre Oct 18, 2023

nsadeveloper789 Oct 10, 2023

nsadeveloper789 Oct 10, 2023

plucia-mitre Oct 19, 2023

nsadeveloper789 Oct 20, 2023

plucia-mitre Oct 20, 2023

nsadeveloper789 Nov 3, 2023

plucia-mitre Nov 3, 2023 •

edited

Loading

nsadeveloper789 Nov 3, 2023 •

edited

Loading

nsadeveloper789 Nov 3, 2023

plucia-mitre Nov 6, 2023

nsadeveloper789 Nov 6, 2023

nsadeveloper789 Oct 10, 2023

plucia-mitre Oct 18, 2023

nsadeveloper789 Oct 10, 2023

nsadeveloper789 Oct 10, 2023

jobermayr commented Jan 10, 2024

plucia-mitre commented Jan 11, 2024

plucia-mitre commented Jan 15, 2024

Initial nearly-minimal changes to assembler for Pickled Canary #5801

Initial nearly-minimal changes to assembler for Pickled Canary #5801

Conversation

plucia-mitre commented Sep 18, 2023

nsadeveloper789 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plucia-mitre Nov 3, 2023 • edited Loading

Choose a reason for hiding this comment

nsadeveloper789 Nov 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jobermayr commented Jan 10, 2024

plucia-mitre commented Jan 11, 2024

plucia-mitre commented Jan 15, 2024

plucia-mitre Nov 3, 2023 •

edited

Loading

nsadeveloper789 Nov 3, 2023 •

edited

Loading