Skip to content

Commit 98fc3c8

Browse files
authored
Merge pull request #3417 from trailofbits/docs/extending-polyfile
Documentation Updates
2 parents dde34a4 + 5f88d21 commit 98fc3c8

File tree

5 files changed

+222
-184
lines changed

5 files changed

+222
-184
lines changed

README.md

Lines changed: 9 additions & 181 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,10 @@
88
[![Tests](https://github.com/trailofbits/polyfile/workflows/Tests/badge.svg)](https://github.com/trailofbits/polyfile/actions)
99
[![Slack Status](https://slack.empirehacking.nyc/badge.svg)](https://slack.empirehacking.nyc)
1010

11-
A utility to identify and map the semantic structure of files,
12-
including polyglots, chimeras, and schizophrenic files. It can be used
13-
in conjunction with its sister tool
11+
A utility to identify and map the semantic and syntactic structure of files,
12+
including polyglots, chimeras, and schizophrenic files. It has [a pure-Python implementation of libmagic](#file-support) and can act as a drop-in replacement for the [`file` command](https://github.com/file/file). However, unlike `file`, PolyFile can recursively identify embedded files, like [binwalk](https://github.com/ReFirmLabs/binwalk).
13+
14+
PolyFile can be used in conjunction with its sister tool
1415
[PolyTracker](https://github.com/trailofbits/polytracker) for
1516
_Automated Lexical Annotation and Navigation of Parsers_, a backronym
1617
devised solely for the purpose of collectively referring to the tools
@@ -55,87 +56,7 @@ Found a file of type application/java-archive at byte offset 0
5556
Saved HTML output to output.html
5657
```
5758

58-
Full usage instructions follow:
59-
```
60-
usage: polyfile [-h] [--format {file,mime,html,json,sbud}] [--output OUTPUT]
61-
[--filetype FILETYPE] [--list] [--html HTML] [--explain]
62-
[--only-match-mime] [--only-match] [--require-match]
63-
[--max-matches MAX_MATCHES] [--debugger]
64-
[--eval-command EVAL_COMMAND] [--no-debug-python]
65-
[--quiet | --debug | --trace] [--version] [-dumpversion]
66-
[FILE]
67-
68-
A utility to recursively map the structure of a file.
69-
70-
positional arguments:
71-
FILE the file to analyze; pass '-' or omit to read from STDIN
72-
73-
options:
74-
-h, --help show this help message and exit
75-
--format {file,mime,html,json,sbud}, -r {file,mime,html,json,sbud}
76-
PolyFile's output format
77-
78-
Output formats are:
79-
file ...... the detected formats associated with the file,
80-
like the output of the `file` command
81-
mime ...... the detected MIME types associated with the file,
82-
like the output of the `file --mime-type` command
83-
explain ... like 'mime', but adds a human-readable explanation
84-
for why each MIME type matched
85-
html ...... an interactive HTML-based hex viewer
86-
json ...... a modified version of the SBUD format in JSON syntax
87-
sbud ...... equivalent to 'json'
88-
89-
Multiple formats can be output at once:
90-
91-
polyfile INPUT_FILE -f mime -f json
92-
93-
Their output will be concatenated to STDOUT in the order that
94-
they occur in the arguments.
95-
96-
To save each format to a separate file, see the `--output` argument.
97-
98-
If no format is specified, PolyFile defaults to `--format file`
99-
--output OUTPUT, -o OUTPUT
100-
an optional output path for `--format`
101-
102-
Each instance of `--output` applies to the previous instance
103-
of the `--format` option.
104-
105-
For example:
106-
107-
polyfile INPUT_FILE --format html --output output.html \
108-
--format sbud --output output.json
109-
110-
will save HTML to to `output.html` and SBUD to `output.json`.
111-
No two outputs can be directed at the same file path.
112-
113-
The path can be '-' for STDOUT.
114-
If an `--output` is omitted for a format,
115-
then it will implicitly be printed to STDOUT.
116-
--filetype FILETYPE, -f FILETYPE
117-
explicitly match against the given filetype or filetype wildcard (default is to match against all filetypes)
118-
--list, -l list the supported filetypes for the `--filetype` argument and exit
119-
--html HTML, -t HTML path to write an interactive HTML file for exploring the PDF;
120-
equivalent to `--format html --output HTML`
121-
--explain equivalent to `--format explain
122-
--only-match-mime, -I
123-
"just print out the matching MIME types for the file, one on each line;
124-
equivalent to `--format mime`
125-
--only-match, -m do not attempt to parse known filetypes; only match against file magic
126-
--require-match if no matches are found, exit with code 127
127-
--max-matches MAX_MATCHES
128-
stop scanning after having found this many matches
129-
--debugger, -db drop into an interactive debugger for libmagic file definition matching and PolyFile parsing
130-
--eval-command EVAL_COMMAND, -ex EVAL_COMMAND
131-
execute the given debugger command
132-
--no-debug-python by default, the `--debugger` option will break on custom matchers and prompt to debug using PDB. This option will suppress those prompts.
133-
--quiet, -q suppress all log output
134-
--debug, -d print debug information
135-
--trace, -dd print extra verbose debug information
136-
--version, -v print PolyFile's version information to STDERR
137-
-dumpversion print PolyFile's raw version information to STDOUT and exit
138-
```
59+
Run `polyfile --help` for full usage instructions.
13960

14061
### Interactive Debugger
14162

@@ -145,8 +66,7 @@ You can run PolyFile with the debugger enabled using the `-db` option.
14566

14667
### File Support
14768

148-
PolyFile has a cleanroom, [pure Python implementation of the libmagic file classifier](#libmagic-implementation), and
149-
supports all 263 MIME types that it can identify.
69+
PolyFile has a cleanroom, [pure Python implementation of the libmagic file classifier](#libmagic-implementation), and supports all 263 MIME types that it can identify.
15070

15171
It currently has support for parsing and semantically mapping the following formats:
15272
* PDF, using an instrumented version of [Didier Stevens' public domain, permissive, forensic parser](https://blog.didierstevens.com/programs/pdf-tools/)
@@ -169,7 +89,7 @@ TrID matching code is still shipped with PolyFile and can be invoked programmati
16989

17090
PolyFile has several options for outputting its results, specified by its `--format` option. For computer-readable output, PolyFile has an extension of the [SBuD](https://github.com/corkami/sbud) JSON format described [in the documentation](docs/json_format.md). Prior to version 0.5.0 this was the default output format of PolyFile. However, now the default output format is to mimic the behavior of the `file` command. To maintain the original behavior, use the `--format sbud` option.
17191

172-
### libMagic Implementation
92+
### libmagic Implementation
17393

17494
PolyFile has a cleanroom implementation of [libmagic (used in the `file` command)](https://github.com/file/file).
17595
It can be invoked programmatically by running:
@@ -192,101 +112,9 @@ with open("file_to_test", "rb") as f:
192112
...
193113
```
194114

195-
### Debugging the libmagic DSL
196-
`libmagic` has an esoteric, poorly documented domain-specific language (DSL) for specifying its matching signatures.
197-
You can read the minimal and—as we have discovered in our cleanroom implementation—_incomplete_ documentation by running
198-
`man 5 magic`. PolyFile implements an interactive debugger for stepping through the DSL specifications, modeled after
199-
GDB. You can enter this debugger by passing the `--debugger` or `-db` argument to PolyFile. It is useful for both
200-
implementing new `libmagic` DSLs, as well as figuring out why an existing DSL fails to match against a given file.
201-
```console
202-
$ polyfile -db input_file
203-
PolyFile 0.3.5
204-
Copyright ©2021 Trail of Bits
205-
Apache License Version 2.0 https://www.apache.org/licenses/
206-
207-
For help, type "help".
208-
(polyfile) help
209-
help ....... print this message
210-
continue ... continue execution until the next breakpoint is hit
211-
step ....... step through a single magic test
212-
next ....... continue execution until the next test that matches
213-
where ...... print the context of the current magic test (aliases: info stack and backtrace)
214-
test ....... test the following libmagic DSL test at the current position
215-
print ...... print the computed absolute offset of the following libmagic DSL offset
216-
breakpoint . list the current breakpoints or add a new one
217-
delete ..... delete a breakpoint
218-
quit ....... exit the debugger
219-
```
220-
221-
## Merging Output From PolyTracker
222-
223-
[PolyTracker](https://github.com/trailofbits/polytracker) is PolyFile’s sister utility for automatically instrumenting
224-
a parser to track the input byte offsets operated on by each function. The output of both tools can be merged to
225-
automatically label the semantic purpose of the functions in a parser. For example, given an instrumented black-box
226-
binary, we can quickly determine which functions in the program are responsible for parsing which parts of the input
227-
file format’s grammar. This is an area of active research intended to achieve fully automated grammar extraction from a
228-
parser.
229-
230-
A separate utility called `polymerge` is installed with PolyFile specifically designed to merge the output of both
231-
tools.
232-
233-
```
234-
usage: polymerge [-h] [--cfg CFG] [--cfg-pdf CFG_PDF]
235-
[--dataflow [DATAFLOW ...]] [--no-intermediate-functions]
236-
[--demangle] [--type-hierarchy TYPE_HIERARCHY]
237-
[--type-hierarchy-pdf TYPE_HIERARCHY_PDF] [--diff [DIFF ...]]
238-
[--debug] [--quiet] [--version] [-dumpversion]
239-
FILES [FILES ...]
240-
241-
A utility to merge the JSON output of `polyfile`
242-
with a polytracker.json file from PolyTracker.
243-
244-
https://github.com/trailofbits/polyfile/
245-
https://github.com/trailofbits/polytracker/
246-
247-
positional arguments:
248-
FILES Path to the PolyFile JSON output and/or the PolyTracker JSON output. Merging will only occur if both files are provided. The `--cfg` and `--type-hierarchy` options can be used if only a single file is provided, but no merging will occur.
249-
250-
optional arguments:
251-
-h, --help show this help message and exit
252-
--cfg CFG, -c CFG Optional path to output a Graphviz .dot file representing the control flow graph of the program trace
253-
--cfg-pdf CFG_PDF, -p CFG_PDF
254-
Similar to --cfg, but renders the graph to a PDF instead of outputting the .dot source
255-
--dataflow [DATAFLOW ...]
256-
For the CFG generation options, only render functions that participated in dataflow. `--dataflow 10` means that only functions in the dataflow related to byte 10 should be included. `--dataflow 10:30` means that only functions operating on bytes 10 through 29 should be included. The beginning or end of a range can be omitted and will default to the beginning and end of the file, respectively. Multiple `--dataflow` ranges can be specified. `--dataflow :` will render the CFG only with functions that operated on tainted bytes. Omitting `--dataflow` will produce a CFG containing all functions.
257-
--no-intermediate-functions
258-
To be used in conjunction with `--dataflow`. If enabled, only functions in the dataflow graph if they operated on the tainted bytes. This can result in a disjoint dataflow graph.
259-
--demangle Demangle C++ function names in the CFG (requires that PolyFile was installed with the `demangle` option, or that the `cxxfilt` Python module is installed.)
260-
--type-hierarchy TYPE_HIERARCHY, -t TYPE_HIERARCHY
261-
Optional path to output a Graphviz .dot file representing the type hierarchy extracted from PolyFile
262-
--type-hierarchy-pdf TYPE_HIERARCHY_PDF, -y TYPE_HIERARCHY_PDF
263-
Similar to --type-hierarchy, but renders the graph to a PDF instead of outputting the .dot source
264-
--diff [DIFF ...] Diff an arbitrary number of input polytracker.json files, all treated as the same class, against one or more polytracker.json provided after `--diff` arguments
265-
--debug, -d Print debug information
266-
--quiet, -q Suppress all log output (overrides --debug)
267-
--version, -v Print PolyMerge's version information and exit
268-
-dumpversion Print PolyMerge's raw version information and exit
269-
```
270-
271-
The output of `polymerge` is the same as [PolyFile’s output format](docs/json_format.md), augmented with the following:
272-
1. For each semantic label in the hierarchy, a list of…
273-
1. …functions that operated on bytes tainted with that label; and
274-
2. …functions whose control flow was influenced by bytes tainted with that label.
275-
2. For each type within the semantic hierarchy, a list of functions that are “most specialized” in processing that type.
276-
This process is described in the next section.
277-
278-
`polymerge` can also optionally emit a Graphviz `.dot` file or rendered PDF of the runtime control-flow graph recorded
279-
by PolyTracker.
280-
281-
### Identifying Function Specializations
115+
## Extending PolyFile
282116

283-
As mentioned above, `polymerge` attempts to match each semantic type of the input file to a set of functions that are
284-
“most specialized” in operating on that type. This is an active area of academic research and is likely to change in
285-
the future, but here is the current method employed by `polymerge`:
286-
1. For each semantic type in the input file, collect the functions that operated on bytes from that type;
287-
2. For each function, calculate the Shannon entropy of the different types on which that function operated;
288-
3. Sort the functions by entropy, and select the functions in the smallest standard deviation; and
289-
4. Keep the functions that are shallowest in the dominator tree of the runtime control-flow graph.
117+
Instructions on extending PolyFile to support more file formats with new matchers and parsers is described [in the documentation]([in the documentation](docs/extending_polyfile.md)).
290118

291119
## License and Acknowledgements
292120

0 commit comments

Comments
 (0)