Skip to content

Commit 9c2d20b

Browse files
authored
Merge pull request #3366 from trailofbits/matching-refactor
Matching Refactor
2 parents 7be7409 + 6342553 commit 9c2d20b

28 files changed

+4156
-3416
lines changed

.github/workflows/tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ jobs:
1515
runs-on: ubuntu-latest
1616
strategy:
1717
matrix:
18-
python-version: [3.6, 3.7, 3.8, 3.9]
18+
python-version: [3.7, 3.8, 3.9, "3.10"]
1919

2020
steps:
2121
- uses: actions/checkout@v2

README.md

Lines changed: 93 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -34,34 +34,45 @@ This will automatically install the `polyfile` and `polymerge` executables in yo
3434

3535
```
3636
usage: polyfile [-h] [--filetype FILETYPE] [--list] [--html HTML]
37-
[--try-all-offsets] [--only-match] [--debug] [--quiet]
38-
[--version] [-dumpversion]
37+
[--only-match-mime] [--only-match] [--require-match]
38+
[--max-matches MAX_MATCHES] [--debug] [--trace] [--debugger]
39+
[--no-debug-python] [--quiet] [--version] [-dumpversion]
3940
[FILE]
4041
4142
A utility to recursively map the structure of a file.
4243
4344
positional arguments:
44-
FILE The file to analyze; pass '-' or omit to read from
45+
FILE the file to analyze; pass '-' or omit to read from
4546
STDIN
4647
4748
optional arguments:
4849
-h, --help show this help message and exit
4950
--filetype FILETYPE, -f FILETYPE
50-
Explicitly match against the given filetype (default
51-
is to match against all filetypes)
51+
explicitly match against the given filetype or
52+
filetype wildcard (default is to match against all
53+
filetypes)
5254
--list, -l list the supported filetypes (for the `--filetype`
5355
argument) and exit
54-
--html HTML, -t HTML Path to write an interactive HTML file for exploring
56+
--html HTML, -t HTML path to write an interactive HTML file for exploring
5557
the PDF
56-
--try-all-offsets, -a
57-
Search for a file match at every possible offset; this
58-
can be very slow for larger files
59-
--only-match, -m Do not attempt to parse known filetypes; only match
58+
--only-match-mime, -I
59+
just print out the matching MIME types for the file,
60+
one on each line
61+
--only-match, -m do not attempt to parse known filetypes; only match
6062
against file magic
61-
--debug, -d Print debug information
62-
--quiet, -q Suppress all log output (overrides --debug)
63-
--version, -v Print PolyFile's version information to STDERR
64-
-dumpversion Print PolyFile's raw version information to STDOUT and
63+
--require-match if no matches are found, exit with code 127
64+
--max-matches MAX_MATCHES
65+
stop scanning after having found this many matches
66+
--debug, -d print debug information
67+
--trace, -dd print extra verbose debug information
68+
--debugger, -db drop into an interactive debugger for libmagic file
69+
definition matching and PolyFile parsing
70+
--no-debug-python by default, the `--debugger` option will break on
71+
custom matchers and prompt to debug using PDB. This
72+
option will suppress those prompts.
73+
--quiet, -q suppress all log output (overrides --debug)
74+
--version, -v print PolyFile's version information to STDERR
75+
-dumpversion print PolyFile's raw version information to STDOUT and
6576
exit
6677
```
6778

@@ -76,6 +87,12 @@ You can optionally have PolyFile output an interactive HTML page containing a la
7687
polyfile INPUT_FILE --html output.html > output.json
7788
```
7889

90+
### Interactive Debugger
91+
92+
PolyFile has an interactive debugger both for its file matching and parsing. It can be used to debug a libmagic pattern
93+
definition, determine why a specific file fails to be classified as the expected MIME type, or step through a parser.
94+
You can run PolyFile with the debugger enabled using the `-db` option.
95+
7996
### File Support
8097

8198
PolyFile has a cleanroom, [pure Python implementation of the libmagic file classifier](#libmagic-implementation), and
@@ -102,6 +119,12 @@ TrID matching code is still shipped with PolyFile and can be invoked programmati
102119

103120
PolyFile outputs its mapping in an extension of the [SBuD](https://github.com/corkami/sbud) JSON format described [in the documentation](docs/json_format.md).
104121

122+
PolyFile can also emit a standalone HTML document that contains an interactive hex viewer as well as syntax trees for
123+
the discovered file formats. Simply pass the `--html` argument to PolyFile with an output path:
124+
```console
125+
$ polyfile input_file --html output.html
126+
```
127+
105128
### libMagic Implementation
106129

107130
PolyFile has a cleanroom implementation of [libmagic (used in the `file` command)](https://github.com/file/file).
@@ -125,6 +148,32 @@ with open("file_to_test", "rb") as f:
125148
...
126149
```
127150

151+
### Debugging the libmagic DSL
152+
`libmagic` has an esoteric, poorly documented doman-specific language (DSL) for specifying its matching signatures.
153+
You can read the minimal and—as we have discovered in our cleanroom implementation—_incomplete_ documentation by running
154+
`man 5 magic`. PolyFile implements an interactive debugger for stepping through the DSL specifications, modeled after
155+
GDB. You can enter this debugger by passing the `--debugger` or `-db` argument to PolyFile. It is useful for both
156+
implementing new `libmagic` DSLs, as well as figuring out why an existing DSL fails to match against a given file.
157+
```console
158+
$ polyfile -db input_file
159+
PolyFile 0.3.5
160+
Copyright ©2021 Trail of Bits
161+
Apache License Version 2.0 https://www.apache.org/licenses/
162+
163+
For help, type "help".
164+
(polyfile) help
165+
help ....... print this message
166+
continue ... continue execution until the next breakpoint is hit
167+
step ....... step through a single magic test
168+
next ....... continue execution until the next test that matches
169+
where ...... print the context of the current magic test (aliases: info stack and backtrace)
170+
test ....... test the following libmagic DSL test at the current position
171+
print ...... print the computed absolute offset of the following libmagic DSL offset
172+
breakpoint . list the current breakpoints or add a new one
173+
delete ..... delete a breakpoint
174+
quit ....... exit the debugger
175+
```
176+
128177
## Merging Output From PolyTracker
129178

130179
[PolyTracker](https://github.com/trailofbits/polytracker) is PolyFile’s sister utility for automatically instrumenting
@@ -138,42 +187,41 @@ A separate utility called `polymerge` is installed with PolyFile specifically de
138187
tools.
139188

140189
```
141-
usage: polyfile [-h] [--filetype FILETYPE] [--list] [--html HTML]
142-
[--only-match-mime] [--only-match] [--require-match]
143-
[--max-matches MAX_MATCHES] [--debug] [--trace] [--quiet]
144-
[--version] [-dumpversion]
145-
[FILE]
190+
usage: polymerge [-h] [--cfg CFG] [--cfg-pdf CFG_PDF]
191+
[--dataflow [DATAFLOW ...]] [--no-intermediate-functions]
192+
[--demangle] [--type-hierarchy TYPE_HIERARCHY]
193+
[--type-hierarchy-pdf TYPE_HIERARCHY_PDF] [--diff [DIFF ...]]
194+
[--debug] [--quiet] [--version] [-dumpversion]
195+
FILES [FILES ...]
146196
147-
A utility to recursively map the structure of a file.
197+
A utility to merge the JSON output of `polyfile`
198+
with a polytracker.json file from PolyTracker.
199+
200+
https://github.com/trailofbits/polyfile/
201+
https://github.com/trailofbits/polytracker/
148202
149203
positional arguments:
150-
FILE the file to analyze; pass '-' or omit to read from
151-
STDIN
204+
FILES Path to the PolyFile JSON output and/or the PolyTracker JSON output. Merging will only occur if both files are provided. The `--cfg` and `--type-hierarchy` options can be used if only a single file is provided, but no merging will occur.
152205
153206
optional arguments:
154207
-h, --help show this help message and exit
155-
--filetype FILETYPE, -f FILETYPE
156-
explicitly match against the given filetype or
157-
filetype wildcard (default is to match against all
158-
filetypes)
159-
--list, -l list the supported filetypes (for the `--filetype`
160-
argument) and exit
161-
--html HTML, -t HTML path to write an interactive HTML file for exploring
162-
the PDF
163-
--only-match-mime, -I
164-
just print out the matching MIME types for the file,
165-
one on each line
166-
--only-match, -m do not attempt to parse known filetypes; only match
167-
against file magic
168-
--require-match if no matches are found, exit with code 127
169-
--max-matches MAX_MATCHES
170-
stop scanning after having found this many matches
171-
--debug, -d print debug information
172-
--trace, -dd print extra verbose debug information
173-
--quiet, -q suppress all log output (overrides --debug)
174-
--version, -v print PolyFile's version information to STDERR
175-
-dumpversion print PolyFile's raw version information to STDOUT and
176-
exit
208+
--cfg CFG, -c CFG Optional path to output a Graphviz .dot file representing the control flow graph of the program trace
209+
--cfg-pdf CFG_PDF, -p CFG_PDF
210+
Similar to --cfg, but renders the graph to a PDF instead of outputting the .dot source
211+
--dataflow [DATAFLOW ...]
212+
For the CFG generation options, only render functions that participated in dataflow. `--dataflow 10` means that only functions in the dataflow related to byte 10 should be included. `--dataflow 10:30` means that only functions operating on bytes 10 through 29 should be included. The beginning or end of a range can be omitted and will default to the beginning and end of the file, respectively. Multiple `--dataflow` ranges can be specified. `--dataflow :` will render the CFG only with functions that operated on tainted bytes. Omitting `--dataflow` will produce a CFG containing all functions.
213+
--no-intermediate-functions
214+
To be used in conjunction with `--dataflow`. If enabled, only functions in the dataflow graph if they operated on the tainted bytes. This can result in a disjoint dataflow graph.
215+
--demangle Demangle C++ function names in the CFG (requires that PolyFile was installed with the `demangle` option, or that the `cxxfilt` Python module is installed.)
216+
--type-hierarchy TYPE_HIERARCHY, -t TYPE_HIERARCHY
217+
Optional path to output a Graphviz .dot file representing the type hierarchy extracted from PolyFile
218+
--type-hierarchy-pdf TYPE_HIERARCHY_PDF, -y TYPE_HIERARCHY_PDF
219+
Similar to --type-hierarchy, but renders the graph to a PDF instead of outputting the .dot source
220+
--diff [DIFF ...] Diff an arbitrary number of input polytracker.json files, all treated as the same class, against one or more polytracker.json provided after `--diff` arguments
221+
--debug, -d Print debug information
222+
--quiet, -q Suppress all log output (overrides --debug)
223+
--version, -v Print PolyMerge's version information and exit
224+
-dumpversion Print PolyMerge's raw version information and exit
177225
```
178226

179227
The output of `polymerge` is the same as [PolyFile’s output format](docs/json_format.md), augmented with the following:
@@ -202,5 +250,4 @@ This research was developed by [Trail of
202250
Bits](https://www.trailofbits.com/) with funding from the Defense
203251
Advanced Research Projects Agency (DARPA) under the SafeDocs program
204252
as a subcontractor to [Galois](https://galois.com). It is licensed under the [Apache 2.0 license](LICENSE).
205-
The [PDF parser](polyfile/pdfparser.py) is modified from the parser developed by Didier Stevens and released into the
206-
public domain. © 2019, Trail of Bits.
253+
© 2019, Trail of Bits.

hooks/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Default Git Hooks for PolyFile Development
2+
3+
To enable these hooks, developers must run this after cloning the repo:
4+
```bash
5+
$ git config core.hooksPath ./hooks
6+
```

hooks/pre-commit

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#!/bin/sh
2+
#
3+
# An example hook script to verify what is about to be committed.
4+
# Called by "git commit" with no arguments. The hook should
5+
# exit with non-zero status after issuing an appropriate message if
6+
# it wants to stop the commit.
7+
#
8+
# To enable this hook, rename this file to "pre-commit".
9+
10+
if git rev-parse --verify HEAD >/dev/null 2>&1
11+
then
12+
against=HEAD
13+
else
14+
# Initial commit: diff against an empty tree object
15+
against=$(git hash-object -t tree /dev/null)
16+
fi
17+
18+
# If you want to allow non-ASCII filenames set this variable to true.
19+
allownonascii=$(git config --bool hooks.allownonascii)
20+
21+
# Redirect output to stderr.
22+
exec 1>&2
23+
24+
# Cross platform projects tend to avoid non-ASCII filenames; prevent
25+
# them from being added to the repository. We exploit the fact that the
26+
# printable range starts at the space character and ends with tilde.
27+
if [ "$allownonascii" != "true" ] &&
28+
# Note that the use of brackets around a tr range is ok here, (it's
29+
# even required, for portability to Solaris 10's /usr/bin/tr), since
30+
# the square bracket bytes happen to fall in the designated range.
31+
test $(git diff --cached --name-only --diff-filter=A -z $against |
32+
LC_ALL=C tr -d '[ -~]\0' | wc -c) != 0
33+
then
34+
cat <<\EOF
35+
Error: Attempt to add a non-ASCII file name.
36+
37+
This can cause problems if you want to work with people on other platforms.
38+
39+
To be portable it is advisable to rename the file.
40+
41+
If you know what you are doing you can disable this check using:
42+
43+
git config hooks.allownonascii true
44+
EOF
45+
exit 1
46+
fi
47+
48+
# If there are whitespace errors, print the offending file names and fail.
49+
exec git diff-index --check --cached $against --

hooks/pre-push

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
#!/bin/sh
2+
3+
# An example hook script to verify what is about to be pushed. Called by "git
4+
# push" after it has checked the remote status, but before anything has been
5+
# pushed. If this script exits with a non-zero status nothing will be pushed.
6+
#
7+
# This hook is called with the following parameters:
8+
#
9+
# $1 -- Name of the remote to which the push is being done
10+
# $2 -- URL to which the push is being done
11+
#
12+
# If pushing without using a named remote those arguments will be equal.
13+
#
14+
# Information about the commits which are being pushed is supplied as lines to
15+
# the standard input in the form:
16+
#
17+
# <local ref> <local sha1> <remote ref> <remote sha1>
18+
#
19+
# This sample shows how to prevent push of commits where the log message starts
20+
# with "WIP" (work in progress).
21+
22+
#remote="$1"
23+
#url="$2"
24+
#
25+
#z40=0000000000000000000000000000000000000000
26+
#
27+
#while read local_ref local_sha remote_ref remote_sha
28+
#do
29+
# if [ "$local_sha" = $z40 ]
30+
# then
31+
# # Handle delete
32+
# :
33+
# else
34+
# if [ "$remote_sha" = $z40 ]
35+
# then
36+
# # New branch, examine all commits
37+
# range="$local_sha"
38+
# else
39+
# # Update to existing branch, examine new commits
40+
# range="$remote_sha..$local_sha"
41+
# fi
42+
#
43+
# # Check for WIP commit
44+
# commit=`git rev-list -n 1 --grep '^WIP' "$range"`
45+
# if [ -n "$commit" ]
46+
# then
47+
# echo >&2 "Found WIP commit in $local_ref, not pushing"
48+
# exit 1
49+
# fi
50+
# fi
51+
#done
52+
53+
# We could do the following as a `pre-commit` hook, but it's expensive, so only do it pre-push:
54+
echo Linting Python code...
55+
flake8 polyfile tests --exclude polyfile/kaitai/parsers/ --select=E9,F63,F7,F82 1>/dev/null 2>/dev/null
56+
RESULT=$?
57+
if [ $RESULT -ne 0 ]; then
58+
cat <<\EOF
59+
Failed Python lint:
60+
61+
flake8 polyfile tests --exclude polyfile/kaitai/parsers/ --count --select=E9,F63,F7,F82 --show-source --statistics
62+
EOF
63+
exit 1
64+
fi
65+
66+
#flake8 polyfile tests --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics 1>/dev/null 2>/dev/null
67+
#RESULT=$?
68+
#if [ $RESULT -ne 0 ]; then
69+
# cat <<\EOF
70+
#Failed Python lint:
71+
#
72+
# flake8 polyfile tests --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
73+
#EOF
74+
# exit 1
75+
#fi
76+
77+
echo Running Tests...
78+
pytest tests
79+
80+
# echo Type-checking Python code...
81+
# mypy --ignore-missing-imports polyfile tests
82+
#exit $?

polyfile/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
from . import nes, pdf, zipmatcher, trid, kaitaimatcher, polyfile
1+
from . import nes, pdf, jpeg, zipmatcher, kaitaimatcher, languagematcher, polyfile
22
from .__main__ import main
33
from .polyfile import __version__

0 commit comments

Comments
 (0)