Skip to content

Commit 612e27e

Browse files
authored
refactor: improve experimental source code pattern analysis of pypi packages (#965)
Include support for using Semgrep for analysis of source code to detect malicious code patterns, specified using Semgrep's YAML files. Signed-off-by: Carl Flottmann <[email protected]>
1 parent 1c65d5f commit 612e27e

35 files changed

+2245
-720
lines changed

.pre-commit-config.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ repos:
3030
- id: isort
3131
name: Sort import statements
3232
args: [--settings-path, pyproject.toml]
33+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
3334

3435
# Add Black code formatters.
3536
- repo: https://github.com/ambv/black
@@ -38,6 +39,7 @@ repos:
3839
- id: black
3940
name: Format code
4041
args: [--config, pyproject.toml]
42+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
4143
- repo: https://github.com/asottile/blacken-docs
4244
rev: 1.19.1
4345
hooks:
@@ -65,6 +67,7 @@ repos:
6567
files: ^src/macaron/|^tests/
6668
types: [text, python]
6769
additional_dependencies: [flake8-bugbear==22.10.27, flake8-builtins==2.0.1, flake8-comprehensions==3.10.1, flake8-docstrings==1.6.0, flake8-mutable==1.2.0, flake8-noqa==1.4.0, flake8-pytest-style==1.6.0, flake8-rst-docstrings==0.3.0, pep8-naming==0.13.2]
70+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
6871
args: [--config, .flake8]
6972

7073
# Check GitHub Actions workflow files.
@@ -82,6 +85,7 @@ repos:
8285
entry: pylint
8386
language: python
8487
files: ^src/macaron/|^tests/
88+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
8589
types: [text, python]
8690
args: [--rcfile, pyproject.toml]
8791

@@ -94,6 +98,7 @@ repos:
9498
language: python
9599
files: ^src/macaron/|^tests/
96100
types: [text, python]
101+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
97102
args: [--show-traceback, --config-file, pyproject.toml]
98103

99104
# Check for potential security issues.
@@ -106,6 +111,7 @@ repos:
106111
files: ^src/macaron/|^tests/
107112
types: [text, python]
108113
additional_dependencies: ['bandit[toml]']
114+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
109115

110116
# Enable a whole bunch of useful helper hooks, too.
111117
# See https://pre-commit.com/hooks.html for more hooks.
@@ -197,6 +203,18 @@ repos:
197203
always_run: true
198204
pass_filenames: false
199205

206+
# Checks that tests/malware_analyzer/pypi/resources/sourcecode_samples files do not have executable permissions
207+
# This is another measure to make sure the files can't be accidentally executed
208+
- repo: local
209+
hooks:
210+
- id: sourcecode-sample-permissions
211+
name: Sourcecode sample executable permissions checker
212+
entry: scripts/dev_scripts/samples_permissions_checker.sh
213+
language: system
214+
always_run: true
215+
pass_filenames: false
216+
217+
200218
# A linter for Golang
201219
- repo: https://github.com/golangci/golangci-lint
202220
rev: v1.64.6

.semgrepignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# Items added to this file will be ignored by Semgrep.

CONTRIBUTING.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,10 @@ See below for instructions to set up the development environment.
7272
- PRs should be merged using the `Squash and merge` strategy. In most cases a single commit with
7373
a detailed commit message body is preferred. Make sure to keep the `Signed-off-by` line in the body.
7474

75+
### PyPI Malware Detection Contribution
76+
77+
Please see the [README for the malware analyzer](./src/macaron/malware_analyzer/README.md) for information on contributing Heuristics and code patterns.
78+
7579
## Branching model
7680

7781
* The `main` branch should be used as the base branch for pull requests. The `release` branch is designated for releases and should only be merged into when creating a new release for Macaron.

docker/Dockerfile.final

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ RUN : \
4646
&& . .venv/bin/activate \
4747
&& pip install --no-compile --no-cache-dir --upgrade pip setuptools \
4848
&& find $HOME/dist -depth \( -type f \( -name "macaron-*.whl" \) \) -exec pip install --no-compile --no-cache-dir '{}' \; \
49-
&& pip uninstall semgrep \
49+
&& pip uninstall semgrep -y \
5050
&& find $HOME/dist -depth \( -type f \( -name "semgrep-*.whl" \) \) -exec pip install --no-compile --no-cache-dir '{}' \; \
5151
&& rm -rf $HOME/dist \
5252
&& deactivate

docs/source/pages/developers_guide/apidoc/macaron.malware_analyzer.pypi_heuristics.sourcecode.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,14 @@ macaron.malware\_analyzer.pypi\_heuristics.sourcecode package
99
Submodules
1010
----------
1111

12+
macaron.malware\_analyzer.pypi\_heuristics.sourcecode.pypi\_sourcecode\_analyzer module
13+
---------------------------------------------------------------------------------------
14+
15+
.. automodule:: macaron.malware_analyzer.pypi_heuristics.sourcecode.pypi_sourcecode_analyzer
16+
:members:
17+
:undoc-members:
18+
:show-inheritance:
19+
1220
macaron.malware\_analyzer.pypi\_heuristics.sourcecode.suspicious\_setup module
1321
------------------------------------------------------------------------------
1422

pyproject.toml

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ dependencies = [
3737
"beautifulsoup4 >= 4.12.0,<5.0.0",
3838
"problog >= 2.2.6,<3.0.0",
3939
"cryptography >=44.0.0,<45.0.0",
40+
"semgrep == 1.113.0",
4041
]
4142
keywords = []
4243
# https://pypi.org/classifiers/
@@ -119,12 +120,14 @@ Issues = "https://github.com/oracle/macaron/issues"
119120
[tool.bandit]
120121
tests = []
121122
skips = ["B101"]
122-
123+
exclude_dirs = ['tests/malware_analyzer/pypi/resources/sourcecode_samples']
123124

124125
# https://github.com/psf/black#configuration
125126
[tool.black]
126127
line-length = 120
127-
128+
force-exclude = '''
129+
tests/malware_analyzer/pypi/resources/sourcecode_samples/
130+
'''
128131

129132
# https://github.com/commitizen-tools/commitizen
130133
# https://commitizen-tools.github.io/commitizen/bump/
@@ -170,7 +173,6 @@ exclude = [
170173
"SECURITY.md",
171174
]
172175

173-
174176
# https://pycqa.github.io/isort/
175177
[tool.isort]
176178
profile = "black"
@@ -181,7 +183,6 @@ skip_gitignore = true
181183

182184
# https://mypy.readthedocs.io/en/stable/config_file.html#using-a-pyproject-toml
183185
[tool.mypy]
184-
# exclude=
185186
show_error_codes = true
186187
show_column_numbers = true
187188
check_untyped_defs = true
@@ -209,7 +210,6 @@ module = [
209210
]
210211
ignore_missing_imports = true
211212

212-
213213
# https://pylint.pycqa.org/en/latest/user_guide/configuration/index.html
214214
[tool.pylint.MASTER]
215215
fail-under = 10.0
@@ -261,6 +261,7 @@ addopts = """-vv -ra --tb native \
261261
--doctest-modules --doctest-continue-on-failure --doctest-glob '*.rst' \
262262
--cov macaron \
263263
--ignore tests/integration \
264+
--ignore tests/malware_analyzer/pypi/resources/sourcecode_samples \
264265
""" # Consider adding --pdb
265266
# https://docs.python.org/3/library/doctest.html#option-flags
266267
doctest_optionflags = "IGNORE_EXCEPTION_DETAIL"
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/usr/bin/env bash
2+
3+
# Copyright (c) 2022 - 2025, Oracle and/or its affiliates. All rights reserved.
4+
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
5+
6+
#
7+
# Checks if the files in tests/malware_analyzer/pypi/resources/sourcecode_samples have executable permissions,
8+
# failing if any do.
9+
#
10+
11+
# Strict bash options.
12+
#
13+
# -e: exit immediately if a command fails (with non-zero return code),
14+
# or if a function returns non-zero.
15+
#
16+
# -u: treat unset variables and parameters as error when performing
17+
# parameter expansion.
18+
# In case a variable ${VAR} is unset but we still need to expand,
19+
# use the syntax ${VAR:-} to expand it to an empty string.
20+
#
21+
# -o pipefail: set the return value of a pipeline to the value of the last
22+
# (rightmost) command to exit with a non-zero status, or zero
23+
# if all commands in the pipeline exit successfully.
24+
#
25+
# Reference: https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html.
26+
set -euo pipefail
27+
28+
MACARON_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && cd ../.. && pwd)"
29+
SAMPLES_PATH="${MACARON_DIR}/tests/malware_analyzer/pypi/resources/sourcecode_samples"
30+
31+
# any files have any of the executable bits set
32+
executables=$( ( find "$SAMPLES_PATH" -type f -perm -u+x -o -type f -perm -g+x -o -type f -perm -o+x | sed "s|$MACARON_DIR/||"; git ls-files "$SAMPLES_PATH" --full-name) | sort | uniq -d)
33+
if [ -n "$executables" ]; then
34+
echo "The following files should not have any executable permissions:"
35+
echo "$executables"
36+
exit 1
37+
fi

src/macaron/__main__.py

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,10 @@ def analyze_slsa_levels_single(analyzer_single_args: argparse.Namespace) -> None
9696

9797
global_config.local_maven_repo = user_provided_local_maven_repo
9898

99+
if analyzer_single_args.force_analyze_source and not analyzer_single_args.analyze_source:
100+
logger.error("'--force-analyze-source' requires '--analyze-source'.")
101+
sys.exit(os.EX_USAGE)
102+
99103
analyzer = Analyzer(global_config.output_path, global_config.build_log_path)
100104

101105
# Initiate reporters.
@@ -172,8 +176,9 @@ def analyze_slsa_levels_single(analyzer_single_args: argparse.Namespace) -> None
172176
analyzer_single_args.sbom_path,
173177
deps_depth,
174178
provenance_payload=prov_payload,
175-
validate_malware=analyzer_single_args.validate_malware,
176179
verify_provenance=analyzer_single_args.verify_provenance,
180+
analyze_source=analyzer_single_args.analyze_source,
181+
force_analyze_source=analyzer_single_args.force_analyze_source,
177182
)
178183
sys.exit(status_code)
179184

@@ -477,10 +482,22 @@ def main(argv: list[str] | None = None) -> None:
477482
)
478483

479484
single_analyze_parser.add_argument(
480-
"--validate-malware",
485+
"--analyze-source",
481486
required=False,
482487
action="store_true",
483-
help=("Enable malware validation."),
488+
help=(
489+
"For improved malware detection, analyze the source code of the"
490+
+ " (PyPI) package using a textual scan and dataflow analysis."
491+
),
492+
)
493+
494+
single_analyze_parser.add_argument(
495+
"--force-analyze-source",
496+
required=False,
497+
action="store_true",
498+
help=(
499+
"Forces PyPI sourcecode analysis to run regardless of other heuristic results. Requires '--analyze-source'."
500+
),
484501
)
485502

486503
single_analyze_parser.add_argument(

src/macaron/config/defaults.ini

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -611,3 +611,27 @@ scaling = 0.15
611611
cost = 1.0
612612
# The path to the file that contains the list of popular packages.
613613
popular_packages_path =
614+
615+
# ==== The following sections are for source code analysis using Semgrep ====
616+
# rulesets: a reference to a 'ruleset' in this section refers to a Semgrep .yaml file containing one or more rules.
617+
# rules: a reference to a 'rule' in this section refers to an individual rule ID, specified by the '- id:' field in
618+
# the Segmrep .yaml file.
619+
# default rulesets: these are a collection of rulesets provided with Macaron which are run by default with the sourcecode
620+
# analyzer. These live in src/macaron/resources/pypi_malware_rules.
621+
# custom rulesets: this is a collection of user-provided rulesets, living inside the path provided to 'custom_semgrep_rules_path'.
622+
623+
# disable default semgrep rulesets here (i.e. all rule IDs in a Semgrep .yaml file) using ruleset names, the name
624+
# without the .yaml prefix. Currently, we disable the exfiltration rulesets by default due to a high false positive rate.
625+
# This list may not contain duplicated elements. Macaron's default ruleset names are all unique.
626+
disabled_default_rulesets = exfiltration
627+
# disable individual rules here (i.e. individual rule IDs inside a Semgrep .yaml file) using rule IDs. You may also
628+
# provide the IDs of your custom semgrep rules here too, as all Semgrep rule IDs must be unique. This list may not contain
629+
# duplicated elements.
630+
disabled_rules =
631+
# absolute path to a directory where a custom set of semgrep rules for source code analysis are stored. These will be included
632+
# with Macaron's default rules. The path will be normalised to the OS path type.
633+
custom_semgrep_rules_path =
634+
# disable custom semgrep rulesets here (i.e. all rule IDs in a Semgrep .yaml file) using ruleset names, the name without the
635+
# .yaml prefix. Note, this will be ignored if a path to custom semgrep rules is not provided. This list may not contain
636+
# duplicated elements, meaning that ruleset names must be unique.
637+
disabled_custom_rulesets =

src/macaron/errors.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,3 +109,7 @@ class HeuristicAnalyzerValueError(MacaronError):
109109

110110
class LocalArtifactFinderError(MacaronError):
111111
"""Happens when there is an error looking for local artifacts."""
112+
113+
114+
class SourceCodeError(MacaronError):
115+
"""Error for operations on package source code."""

src/macaron/malware_analyzer/README.md

Lines changed: 49 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Implementation of Heuristic Malware Detector
1+
# Implementation of Malware Detector
22

33
## PyPI Ecosystem
44

@@ -56,6 +56,20 @@ When a heuristic fails, with `HeuristicResult.FAIL`, then that is an indicator b
5656
- **Description**: Checks if the package name is suspiciously similar to any package name in a predefined list of popular packages. The similarity check incorporates the Jaro-Winkler distance and considers keyboard layout proximity to identify potential typosquatting.
5757
- **Rule**: Return `HeuristicResult.FAIL` if the similarity ratio between the package name and any popular package name meets or exceeds a defined threshold; otherwise, return `HeuristicResult.PASS`.
5858
- **Dependency**: None.
59+
### Source Code Analysis with Semgrep
60+
61+
The following analyzer has been included as an optional feature, available by supplying `--analyze-source` in the CLI to `macaron analyze`:
62+
63+
**PyPI Source Code Analyzer**
64+
- **Description**: Uses Semgrep, with default rules written in `src/macaron/resources/pypi_malware_rules` and custom rules available by supplying a path to `custom_semgrep_rules` in `defaults.ini`, to scan the package `.tar` source code.
65+
- **Rule**: If any Semgrep rule is triggered, the heuristic fails with `HeuristicResult.FAIL` and subsequently fails the package with `CheckResultType.FAILED`. If no rule is triggered, the heuristic passes with `HeuristicResult.PASS` and the `CheckResultType` result from the combination of all other heuristics is maintained.
66+
- **Dependency**: Will be run if the Source Code Repo fails. This dependency can be bypassed by suppying `--force-analyze-source` in the CLI, along with `--analyze-source`.
67+
68+
This feature is currently a work in progress, and supports detection of code obfuscation techniques and remote exfiltration behaviors. It uses Semgrep OSS for detection. `defaults.ini` may be used to provide custom rules and exclude them:
69+
- `disabled_default_rulesets`: supply to this a comma separated list of the names of default Semgrep rule files (excluding the `.yaml` extension) to disable all rule IDs in that file.
70+
- `disabled_rules`: supply to this a comma separated list of individual rule IDs to disable (from both the default and custom list).
71+
- `custom_semgrep_rules`: supply to this an absolute path to a directory containing custom Semgrep `.yaml` files to be run alongside the default ones.
72+
- `disabled_custom_rulesets`: supply to this a comma separated list of the names of custom Semgrep rule files (excluding the `.yaml` extension) to disable all rule IDs in that file.
5973

6074
### Contributing
6175

@@ -64,13 +78,47 @@ When contributing an analyzer, it must meet the following requirements:
6478
- The analyzer must be implemented in a separate file, placed in the relevant folder based on what it analyzes ([metadata](./pypi_heuristics/metadata/) or [sourcecode](./pypi_heuristics/sourcecode/)).
6579
- The analyzer must inherit from the `BaseHeuristicAnalyzer` class and implement the `analyze` function, returning relevant information specific to the analysis.
6680
- The analyzer name must be added to [heuristics.py](./pypi_heuristics/heuristics.py) file so it can be used for rule combinations in [detect_malicious_metadata_check.py](../slsa_analyzer/checks/detect_malicious_metadata_check.py)
81+
- The analyzer must be added to the list of analyzers in `detect_malicious_metadata_check.py` to be run.
6782
- Update the `malware_rules_problog_model` in [detect_malicious_metadata_check.py](../slsa_analyzer/checks/detect_malicious_metadata_check.py) with logical statements where the heuristic should be included. When adding new rules, please follow the following guidelines:
6883
- Provide a [confidence value](../slsa_analyzer/checks/check_result.py) using the `Confidence` enum.
6984
- Ensure it is assigned to the `problog_result_access` string variable, otherwise it will not be queried and evaluated.
7085
- Assign a rule ID to the rule. This will be used to backtrack to determine if it was triggered.
7186
- Make sure to wrap pass/fail statements in `passed()` and `failed()`. Not doing so may result in undesirable behaviour, see the comments in the model for more details.
7287
- If there are commonly used combinations introduced by adding the heuristic, combine and justify them at the top of the static model (see `quickUndetailed` and `forceSetup` as current examples).
7388

89+
**Contributing Code Pattern Rules**
90+
91+
When contributing more Semgrep rules for `pypi_sourcecode_analyzer.py` to use, the following requirements must be met:
92+
93+
- Semgrep `.yaml` Rules are stored in `src/macaron/resources/pypi_malware_rules` and are named based on the category of code behaviors they detect.
94+
- If the rule comes under one of the already defined categories, place it within that `.yaml` file, else create a new `.yaml` file using the category name.
95+
- Each rule ID must be prefixed by the category followed by a single underscore ('_'), so for obfuscation rules in `obfuscation.yaml` each rule ID is prefixed with `obfuscation_`, followed by an ID which uses a hiphen ('-') as a separator.
96+
- Tests must be written for each rule contributed. These are stored in `tests/malware_analyzer/pypi/test_pypi_sourcescode_analyzer.py`.
97+
- These tests are written on a per-category bases, running each category individually. Each category must have a folder under `tests/malware_analyzer/pypi/resources/sourcecode_samples`.
98+
- Within these folders, there must be sample code patterns for testing, and a file `expected_results.json` with the expected JSON output of the analyzer for that category.
99+
- Each sample code pattern `.py` file must not have executable permissions and must include code that prevents it from being accidentally imported or run. The current files use this method:
100+
101+
```
102+
"""
103+
Running this code will not produce any malicious behavior, but code isolation measures are
104+
in place for safety.
105+
"""
106+
107+
import sys
108+
109+
# ensure no symbols are exported so this code cannot accidentally be used
110+
__all__ = []
111+
sys.exit()
112+
113+
def test_function():
114+
"""
115+
All code to be tested will be defined inside this function, so it is all local to it. This is
116+
to isolate the code to be tested, as it exists to replicate the patterns present in malware
117+
samples.
118+
"""
119+
sys.exit()
120+
```
121+
74122
### Confidence Score Motivation
75123

76124
The original seven heuristics which started this work were Empty Project Link, Unreachable Project Links, One Release, High Release Frequency, Unchange Release, Closer Release Join Date, and Suspicious Setup. These heuristics (excluding those with a dependency) were run on 1167 packages from trusted organizations, with the following results:

src/macaron/malware_analyzer/pypi_heuristics/heuristics.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,9 @@ class Heuristics(str, Enum):
4040
#: Indicates that the package name is similar to a popular package.
4141
TYPOSQUATTING_PRESENCE = "typosquatting_presence"
4242

43+
#: Indicates that the package source code contains suspicious code patterns.
44+
SUSPICIOUS_PATTERNS = "suspicious_patterns"
45+
4346

4447
class HeuristicResult(str, Enum):
4548
"""Result type indicating the outcome of a heuristic."""

0 commit comments

Comments
 (0)