You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A utility to identify and map the semantic structure of files,
12
-
including polyglots, chimeras, and schizophrenic files. It can be used
13
-
in conjunction with its sister tool
11
+
A utility to identify and map the semantic and syntactic structure of files,
12
+
including polyglots, chimeras, and schizophrenic files. It has [a pure-Python implementation of libmagic](#file-support) and can act as a drop-in replacement for the [`file` command](https://github.com/file/file). However, unlike `file`, PolyFile can recursively identify embedded files, like [binwalk](https://github.com/ReFirmLabs/binwalk).
13
+
14
+
PolyFile can be used in conjunction with its sister tool
14
15
[PolyTracker](https://github.com/trailofbits/polytracker) for
15
16
_Automated Lexical Annotation and Navigation of Parsers_, a backronym
16
17
devised solely for the purpose of collectively referring to the tools
@@ -55,87 +56,7 @@ Found a file of type application/java-archive at byte offset 0
file ...... the detected formats associated with the file,
80
-
like the output of the `file` command
81
-
mime ...... the detected MIME types associated with the file,
82
-
like the output of the `file --mime-type` command
83
-
explain ... like 'mime', but adds a human-readable explanation
84
-
for why each MIME type matched
85
-
html ...... an interactive HTML-based hex viewer
86
-
json ...... a modified version of the SBUD format in JSON syntax
87
-
sbud ...... equivalent to 'json'
88
-
89
-
Multiple formats can be output at once:
90
-
91
-
polyfile INPUT_FILE -f mime -f json
92
-
93
-
Their output will be concatenated to STDOUT in the order that
94
-
they occur in the arguments.
95
-
96
-
To save each format to a separate file, see the `--output` argument.
97
-
98
-
If no format is specified, PolyFile defaults to `--format file`
99
-
--output OUTPUT, -o OUTPUT
100
-
an optional output path for `--format`
101
-
102
-
Each instance of `--output` applies to the previous instance
103
-
of the `--format` option.
104
-
105
-
For example:
106
-
107
-
polyfile INPUT_FILE --format html --output output.html \
108
-
--format sbud --output output.json
109
-
110
-
will save HTML to to `output.html` and SBUD to `output.json`.
111
-
No two outputs can be directed at the same file path.
112
-
113
-
The path can be '-' for STDOUT.
114
-
If an `--output` is omitted for a format,
115
-
then it will implicitly be printed to STDOUT.
116
-
--filetype FILETYPE, -f FILETYPE
117
-
explicitly match against the given filetype or filetype wildcard (default is to match against all filetypes)
118
-
--list, -l list the supported filetypes for the `--filetype` argument and exit
119
-
--html HTML, -t HTML path to write an interactive HTML file for exploring the PDF;
120
-
equivalent to `--format html --output HTML`
121
-
--explain equivalent to `--format explain
122
-
--only-match-mime, -I
123
-
"just print out the matching MIME types for the file, one on each line;
124
-
equivalent to `--format mime`
125
-
--only-match, -m do not attempt to parse known filetypes; only match against file magic
126
-
--require-match if no matches are found, exit with code 127
127
-
--max-matches MAX_MATCHES
128
-
stop scanning after having found this many matches
129
-
--debugger, -db drop into an interactive debugger for libmagic file definition matching and PolyFile parsing
130
-
--eval-command EVAL_COMMAND, -ex EVAL_COMMAND
131
-
execute the given debugger command
132
-
--no-debug-python by default, the `--debugger` option will break on custom matchers and prompt to debug using PDB. This option will suppress those prompts.
133
-
--quiet, -q suppress all log output
134
-
--debug, -d print debug information
135
-
--trace, -dd print extra verbose debug information
136
-
--version, -v print PolyFile's version information to STDERR
137
-
-dumpversion print PolyFile's raw version information to STDOUT and exit
138
-
```
59
+
Run `polyfile --help` for full usage instructions.
139
60
140
61
### Interactive Debugger
141
62
@@ -145,8 +66,7 @@ You can run PolyFile with the debugger enabled using the `-db` option.
145
66
146
67
### File Support
147
68
148
-
PolyFile has a cleanroom, [pure Python implementation of the libmagic file classifier](#libmagic-implementation), and
149
-
supports all 263 MIME types that it can identify.
69
+
PolyFile has a cleanroom, [pure Python implementation of the libmagic file classifier](#libmagic-implementation), and supports all 263 MIME types that it can identify.
150
70
151
71
It currently has support for parsing and semantically mapping the following formats:
152
72
* PDF, using an instrumented version of [Didier Stevens' public domain, permissive, forensic parser](https://blog.didierstevens.com/programs/pdf-tools/)
@@ -169,7 +89,7 @@ TrID matching code is still shipped with PolyFile and can be invoked programmati
169
89
170
90
PolyFile has several options for outputting its results, specified by its `--format` option. For computer-readable output, PolyFile has an extension of the [SBuD](https://github.com/corkami/sbud) JSON format described [in the documentation](docs/json_format.md). Prior to version 0.5.0 this was the default output format of PolyFile. However, now the default output format is to mimic the behavior of the `file` command. To maintain the original behavior, use the `--format sbud` option.
171
91
172
-
### libMagic Implementation
92
+
### libmagic Implementation
173
93
174
94
PolyFile has a cleanroom implementation of [libmagic (used in the `file` command)](https://github.com/file/file).
175
95
It can be invoked programmatically by running:
@@ -192,101 +112,9 @@ with open("file_to_test", "rb") as f:
192
112
...
193
113
```
194
114
195
-
### Debugging the libmagic DSL
196
-
`libmagic` has an esoteric, poorly documented domain-specific language (DSL) for specifying its matching signatures.
197
-
You can read the minimal and—as we have discovered in our cleanroom implementation—_incomplete_ documentation by running
198
-
`man 5 magic`. PolyFile implements an interactive debugger for stepping through the DSL specifications, modeled after
199
-
GDB. You can enter this debugger by passing the `--debugger` or `-db` argument to PolyFile. It is useful for both
200
-
implementing new `libmagic` DSLs, as well as figuring out why an existing DSL fails to match against a given file.
FILES Path to the PolyFile JSON output and/or the PolyTracker JSON output. Merging will only occur if both files are provided. The `--cfg` and `--type-hierarchy` options can be used if only a single file is provided, but no merging will occur.
249
-
250
-
optional arguments:
251
-
-h, --help show this help message and exit
252
-
--cfg CFG, -c CFG Optional path to output a Graphviz .dot file representing the control flow graph of the program trace
253
-
--cfg-pdf CFG_PDF, -p CFG_PDF
254
-
Similar to --cfg, but renders the graph to a PDF instead of outputting the .dot source
255
-
--dataflow [DATAFLOW ...]
256
-
For the CFG generation options, only render functions that participated in dataflow. `--dataflow 10` means that only functions in the dataflow related to byte 10 should be included. `--dataflow 10:30` means that only functions operating on bytes 10 through 29 should be included. The beginning or end of a range can be omitted and will default to the beginning and end of the file, respectively. Multiple `--dataflow` ranges can be specified. `--dataflow :` will render the CFG only with functions that operated on tainted bytes. Omitting `--dataflow` will produce a CFG containing all functions.
257
-
--no-intermediate-functions
258
-
To be used in conjunction with `--dataflow`. If enabled, only functions in the dataflow graph if they operated on the tainted bytes. This can result in a disjoint dataflow graph.
259
-
--demangle Demangle C++ function names in the CFG (requires that PolyFile was installed with the `demangle` option, or that the `cxxfilt` Python module is installed.)
Similar to --type-hierarchy, but renders the graph to a PDF instead of outputting the .dot source
264
-
--diff [DIFF ...] Diff an arbitrary number of input polytracker.json files, all treated as the same class, against one or more polytracker.json provided after `--diff` arguments
265
-
--debug, -d Print debug information
266
-
--quiet, -q Suppress all log output (overrides --debug)
267
-
--version, -v Print PolyMerge's version information and exit
268
-
-dumpversion Print PolyMerge's raw version information and exit
269
-
```
270
-
271
-
The output of `polymerge` is the same as [PolyFile’s output format](docs/json_format.md), augmented with the following:
272
-
1. For each semantic label in the hierarchy, a list of…
273
-
1. …functions that operated on bytes tainted with that label; and
274
-
2. …functions whose control flow was influenced by bytes tainted with that label.
275
-
2. For each type within the semantic hierarchy, a list of functions that are “most specialized” in processing that type.
276
-
This process is described in the next section.
277
-
278
-
`polymerge` can also optionally emit a Graphviz `.dot` file or rendered PDF of the runtime control-flow graph recorded
279
-
by PolyTracker.
280
-
281
-
### Identifying Function Specializations
115
+
## Extending PolyFile
282
116
283
-
As mentioned above, `polymerge` attempts to match each semantic type of the input file to a set of functions that are
284
-
“most specialized” in operating on that type. This is an active area of academic research and is likely to change in
285
-
the future, but here is the current method employed by `polymerge`:
286
-
1. For each semantic type in the input file, collect the functions that operated on bytes from that type;
287
-
2. For each function, calculate the Shannon entropy of the different types on which that function operated;
288
-
3. Sort the functions by entropy, and select the functions in the smallest standard deviation; and
289
-
4. Keep the functions that are shallowest in the dominator tree of the runtime control-flow graph.
117
+
Instructions on extending PolyFile to support more file formats with new matchers and parsers is described [in the documentation]([in the documentation](docs/extending_polyfile.md)).
0 commit comments