Skip to content

refactor: improve experimental source code pattern analysis of pypi packages #965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
9a191d5
refactor: refactoring existing source code analysis functionality
art1f1c3R Jan 17, 2025
52494f5
build: updated project to include semgrep as an experimental dependency
art1f1c3R Jan 20, 2025
5ecd8aa
refactor: support for semgrep as the code analysis tool
art1f1c3R Jan 23, 2025
d9aff2c
fix: entire source code is no longer stored in memory
art1f1c3R Jan 23, 2025
7a8d633
feat: support for semgrep rules, currently two implemented, with cust…
art1f1c3R Jan 30, 2025
515e502
test: setup test environment for source code analyzer
art1f1c3R Feb 3, 2025
3aaa808
test: finished sample test files for obfuscation rules
art1f1c3R Feb 4, 2025
ee95fb3
fix: obfuscation tests were incorrect
art1f1c3R Feb 4, 2025
21c6748
test: tests for exfiltration and fixes to semgrep rules
art1f1c3R Feb 4, 2025
6c1efd3
test: testing for invalid pathways in defaults configuration
art1f1c3R Feb 5, 2025
d3bf20c
feat: dependency on empty project link, and context manager for sourc…
art1f1c3R Feb 5, 2025
f23e84b
chore: added pre-commit hook for sourcecode sample files execution pe…
art1f1c3R Feb 5, 2025
890a54b
fix: path outputs are now relative to package, making tests work and …
art1f1c3R Feb 6, 2025
d5beddb
fix: semgrep now only runs open-source functionality, and disabled th…
art1f1c3R Feb 6, 2025
ffe11b0
test: added experimental feature to main malware check, tests updated…
art1f1c3R Feb 11, 2025
2499da1
chore: updated pre-commit hook to only consider tracked files
art1f1c3R Feb 12, 2025
dccf08b
chore: added oss only to semgrep validate
art1f1c3R Feb 12, 2025
75b8c11
chore: removed old code
art1f1c3R Feb 24, 2025
9303363
feat: updated semgrep rules to reduce false positives based on ICSE25…
art1f1c3R Feb 27, 2025
e14d202
test: fixed broken tests for semgrep rules
art1f1c3R Feb 27, 2025
064791a
fix: obfuscation rules has updated socket patterns
art1f1c3R Feb 27, 2025
f3d7607
feat: added new, refined inline imports rule back in
art1f1c3R Feb 27, 2025
01d4803
docs: made API docs and updated malware analyzer README
art1f1c3R Feb 27, 2025
92928ee
docs: updated README and CONTRIBUTING for information on how to contr…
art1f1c3R Mar 6, 2025
03198bd
chore: removed old unused suspicious pattern yaml file. preserved in …
art1f1c3R Mar 6, 2025
d4cb8a2
chore: updated sample permissions checker to have better error output
art1f1c3R Mar 10, 2025
22e59e0
chore: included semgrep message for each rule in JSON output for expl…
art1f1c3R Mar 17, 2025
14174de
fix: updated sourcecode analyzer name appropriately
art1f1c3R Mar 26, 2025
7d07693
chore: sourcecode analyzer now depends on source code repo heuristic
art1f1c3R Mar 31, 2025
2ec3955
fix: now depends on source code repo being skipped as well
art1f1c3R Apr 14, 2025
75c99e4
chore: rebasing onto main
art1f1c3R Apr 17, 2025
c158416
chore: rebasing onto main
art1f1c3R Apr 17, 2025
a78390a
fix: build error after rebase fixed
art1f1c3R Apr 17, 2025
55a9dcb
fix: ci problems with formatting on test file
art1f1c3R Apr 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ repos:
- id: isort
name: Sort import statements
args: [--settings-path, pyproject.toml]
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*

# Add Black code formatters.
- repo: https://github.com/ambv/black
Expand All @@ -38,6 +39,7 @@ repos:
- id: black
name: Format code
args: [--config, pyproject.toml]
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
- repo: https://github.com/asottile/blacken-docs
rev: 1.19.1
hooks:
Expand Down Expand Up @@ -65,6 +67,7 @@ repos:
files: ^src/macaron/|^tests/
types: [text, python]
additional_dependencies: [flake8-bugbear==22.10.27, flake8-builtins==2.0.1, flake8-comprehensions==3.10.1, flake8-docstrings==1.6.0, flake8-mutable==1.2.0, flake8-noqa==1.4.0, flake8-pytest-style==1.6.0, flake8-rst-docstrings==0.3.0, pep8-naming==0.13.2]
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
args: [--config, .flake8]

# Check GitHub Actions workflow files.
Expand All @@ -82,6 +85,7 @@ repos:
entry: pylint
language: python
files: ^src/macaron/|^tests/
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
types: [text, python]
args: [--rcfile, pyproject.toml]

Expand All @@ -94,6 +98,7 @@ repos:
language: python
files: ^src/macaron/|^tests/
types: [text, python]
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
args: [--show-traceback, --config-file, pyproject.toml]

# Check for potential security issues.
Expand All @@ -106,6 +111,7 @@ repos:
files: ^src/macaron/|^tests/
types: [text, python]
additional_dependencies: ['bandit[toml]']
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*

# Enable a whole bunch of useful helper hooks, too.
# See https://pre-commit.com/hooks.html for more hooks.
Expand Down Expand Up @@ -197,6 +203,18 @@ repos:
always_run: true
pass_filenames: false

# Checks that tests/malware_analyzer/pypi/resources/sourcecode_samples files do not have executable permissions
# This is another measure to make sure the files can't be accidentally executed
- repo: local
hooks:
- id: sourcecode-sample-permissions
name: Sourcecode sample executable permissions checker
entry: scripts/dev_scripts/samples_permissions_checker.sh
language: system
always_run: true
pass_filenames: false


# A linter for Golang
- repo: https://github.com/golangci/golangci-lint
rev: v1.64.6
Expand Down
1 change: 1 addition & 0 deletions .semgrepignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Items added to this file will be ignored by Semgrep.
4 changes: 4 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,10 @@ See below for instructions to set up the development environment.
- PRs should be merged using the `Squash and merge` strategy. In most cases a single commit with
a detailed commit message body is preferred. Make sure to keep the `Signed-off-by` line in the body.

### PyPI Malware Detection Contribution

Please see the [README for the malware analyzer](./src/macaron/malware_analyzer/README.md) for information on contributing Heuristics and code patterns.

## Branching model

* The `main` branch should be used as the base branch for pull requests. The `release` branch is designated for releases and should only be merged into when creating a new release for Macaron.
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ upgrade: .venv/upgraded-on
.venv/upgraded-on: pyproject.toml
python -m pip install --upgrade pip
python -m pip install --upgrade wheel
python -m pip install --upgrade --upgrade-strategy eager --editable .[actions,dev,docs,hooks,test,test-docker]
python -m pip install --upgrade --upgrade-strategy eager --editable .[actions,dev,docs,hooks,test,test-docker,experimental]
$(MAKE) upgrade-quiet
force-upgrade:
rm -f .venv/upgraded-on
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,14 @@ macaron.malware\_analyzer.pypi\_heuristics.sourcecode package
Submodules
----------

macaron.malware\_analyzer.pypi\_heuristics.sourcecode.pypi\_sourcecode\_analyzer module
---------------------------------------------------------------------------------------

.. automodule:: macaron.malware_analyzer.pypi_heuristics.sourcecode.pypi_sourcecode_analyzer
:members:
:undoc-members:
:show-inheritance:

macaron.malware\_analyzer.pypi\_heuristics.sourcecode.suspicious\_setup module
------------------------------------------------------------------------------

Expand Down
10 changes: 5 additions & 5 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,10 @@ test-docker = [
"ruamel.yaml >=0.18.6,<1.0.0",
]

experimental = [
"semgrep == 1.102.0",
]

[project.urls]
Homepage = "https://github.com/oracle/macaron"
Changelog = "https://github.com/oracle/macaron/blob/main/CHANGELOG.md"
Expand All @@ -119,12 +123,10 @@ Issues = "https://github.com/oracle/macaron/issues"
tests = []
skips = ["B101"]


# https://github.com/psf/black#configuration
[tool.black]
line-length = 120


# https://github.com/commitizen-tools/commitizen
# https://commitizen-tools.github.io/commitizen/bump/
[tool.commitizen]
Expand Down Expand Up @@ -169,7 +171,6 @@ exclude = [
"SECURITY.md",
]


# https://pycqa.github.io/isort/
[tool.isort]
profile = "black"
Expand All @@ -180,7 +181,6 @@ skip_gitignore = true

# https://mypy.readthedocs.io/en/stable/config_file.html#using-a-pyproject-toml
[tool.mypy]
# exclude=
show_error_codes = true
show_column_numbers = true
check_untyped_defs = true
Expand Down Expand Up @@ -208,7 +208,6 @@ module = [
]
ignore_missing_imports = true


# https://pylint.pycqa.org/en/latest/user_guide/configuration/index.html
[tool.pylint.MASTER]
fail-under = 10.0
Expand Down Expand Up @@ -260,6 +259,7 @@ addopts = """-vv -ra --tb native \
--doctest-modules --doctest-continue-on-failure --doctest-glob '*.rst' \
--cov macaron \
--ignore tests/integration \
--ignore tests/malware_analyzer/pypi/resources/sourcecode_samples \
""" # Consider adding --pdb
# https://docs.python.org/3/library/doctest.html#option-flags
doctest_optionflags = "IGNORE_EXCEPTION_DETAIL"
Expand Down
37 changes: 37 additions & 0 deletions scripts/dev_scripts/samples_permissions_checker.sh
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been run through Shellcheck, no issues are detected.

Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/usr/bin/env bash

# Copyright (c) 2022 - 2025, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.

#
# Checks if the files in tests/malware_analyzer/pypi/resources/sourcecode_samples have executable permissions,
# failing if any do.
#

# Strict bash options.
#
# -e: exit immediately if a command fails (with non-zero return code),
# or if a function returns non-zero.
#
# -u: treat unset variables and parameters as error when performing
# parameter expansion.
# In case a variable ${VAR} is unset but we still need to expand,
# use the syntax ${VAR:-} to expand it to an empty string.
#
# -o pipefail: set the return value of a pipeline to the value of the last
# (rightmost) command to exit with a non-zero status, or zero
# if all commands in the pipeline exit successfully.
#
# Reference: https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html.
set -euo pipefail

MACARON_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && cd ../.. && pwd)"
SAMPLES_PATH="${MACARON_DIR}/tests/malware_analyzer/pypi/resources/sourcecode_samples"

# any files have any of the executable bits set
executables=$( ( find "$SAMPLES_PATH" -type f -perm -u+x -o -type f -perm -g+x -o -type f -perm -o+x | sed "s|$MACARON_DIR/||"; git ls-files "$SAMPLES_PATH" --full-name) | sort | uniq -d)
if [ -n "$executables" ]; then
echo "The following files should not have any executable permissions:"
echo "$executables"
exit 1
fi
9 changes: 6 additions & 3 deletions src/macaron/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,8 +172,8 @@ def analyze_slsa_levels_single(analyzer_single_args: argparse.Namespace) -> None
analyzer_single_args.sbom_path,
deps_depth,
provenance_payload=prov_payload,
validate_malware=analyzer_single_args.validate_malware,
verify_provenance=analyzer_single_args.verify_provenance,
analyze_source=analyzer_single_args.analyze_source,
)
sys.exit(status_code)

Expand Down Expand Up @@ -477,10 +477,13 @@ def main(argv: list[str] | None = None) -> None:
)

single_analyze_parser.add_argument(
"--validate-malware",
"--analyze-source",
required=False,
action="store_true",
help=("Enable malware validation."),
help=(
"EXPERIMENTAL. For improved malware detection, analyze the source code of the"
+ " (PyPI) package using a textual scan and dataflow analysis."
),
)

single_analyze_parser.add_argument(
Expand Down
4 changes: 4 additions & 0 deletions src/macaron/config/defaults.ini
Original file line number Diff line number Diff line change
Expand Up @@ -593,3 +593,7 @@ major_threshold = 20
epoch_threshold = 3
# The number of days +/- the day of publish the calendar versioning day may be.
day_publish_error = 4

# absolute path to where a custom set of semgrep rules for source code analysis are stored. These will be included
# with Macaron's default rules. The path will be normalised to the OS path type.
custom_semgrep_rules =
4 changes: 4 additions & 0 deletions src/macaron/errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,3 +105,7 @@ class HeuristicAnalyzerValueError(MacaronError):

class LocalArtifactFinderError(MacaronError):
"""Happens when there is an error looking for local artifacts."""


class SourceCodeError(MacaronError):
"""Error for operations on package source code."""
48 changes: 47 additions & 1 deletion src/macaron/malware_analyzer/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Implementation of Heuristic Malware Detector
# Implementation of Malware Detector

## PyPI Ecosystem

Expand Down Expand Up @@ -52,20 +52,66 @@ When a heuristic fails, with `HeuristicResult.FAIL`, then that is an indicator b
- **Rule**: Return `HeuristicResult.FAIL` if the major or epoch is abnormally high; otherwise, return `HeuristicResult.PASS`.
- **Dependency**: Will be run if the One Release heuristic fails.

### Experimental: Source Code Analysis with Semgrep

The following analyzer has been added in as an experimental feature, available by supplying `--analyze-source` in the CLI to `macaron analyze`:

**PyPI Source Code Analyzer**
- **Description**: Uses Semgrep, with default rules written in `src/macaron/resources/pypi_malware_rules` and custom rules available by supplying a path to `custom_semgrep_rules` in `defaults.ini`, to scan the package `.tar` source code.
- **Rule**: If any Semgrep rule is triggered, the heuristic fails with `HeuristicResult.FAIL` and subsequently fails the package with `CheckResultType.FAILED`. If no rule is triggered, the heuristic passes with `HeuristicResult.PASS` and the `CheckResultType` result from the combination of all other heuristics is maintained.
- **Dependency**: Will be run if the Source Code Repo fails.

This feature is currently a work in progress, and supports detection of code obfuscation techniques and remote exfiltration behaviors. It uses Semgrep OSS for detection.

### Contributing

When contributing an analyzer, it must meet the following requirements:

- The analyzer must be implemented in a separate file, placed in the relevant folder based on what it analyzes ([metadata](./pypi_heuristics/metadata/) or [sourcecode](./pypi_heuristics/sourcecode/)).
- The analyzer must inherit from the `BaseHeuristicAnalyzer` class and implement the `analyze` function, returning relevant information specific to the analysis.
- The analyzer name must be added to [heuristics.py](./pypi_heuristics/heuristics.py) file so it can be used for rule combinations in [detect_malicious_metadata_check.py](../slsa_analyzer/checks/detect_malicious_metadata_check.py)
- The analyzer must be added to the list of analyzers in `detect_malicious_metadata_check.py` to be run.
- Update the `malware_rules_problog_model` in [detect_malicious_metadata_check.py](../slsa_analyzer/checks/detect_malicious_metadata_check.py) with logical statements where the heuristic should be included. When adding new rules, please follow the following guidelines:
- Provide a [confidence value](../slsa_analyzer/checks/check_result.py) using the `Confidence` enum.
- Ensure it is assigned to the `problog_result_access` string variable, otherwise it will not be queried and evaluated.
- Assign a rule ID to the rule. This will be used to backtrack to determine if it was triggered.
- Make sure to wrap pass/fail statements in `passed()` and `failed()`. Not doing so may result in undesirable behaviour, see the comments in the model for more details.
- If there are commonly used combinations introduced by adding the heuristic, combine and justify them at the top of the static model (see `quickUndetailed` and `forceSetup` as current examples).

**Contributing Code Pattern Rules**

When contributing more Semgrep rules for `pypi_sourcecode_analyzer.py` to use, the following requirements must be met:

- Semgrep `.yaml` Rules are stored in `src/macaron/resources/pypi_malware_rules` and are named based on the category of code behaviors they detect.
- If the rule comes under one of the already defined categories, place it within that `.yaml` file, else create a new `.yaml` file using the category name.
- Each rule ID must be prefixed by the category followed by a single underscore ('_'), so for obfuscation rules in `obfuscation.yaml` each rule ID is prefixed with `obfuscation_`, followed by an ID which uses a hiphen ('-') as a separator.
- Tests must be written for each rule contributed. These are stored in `tests/malware_analyzer/pypi/test_pypi_sourcescode_analyzer.py`.
- These tests are written on a per-category bases, running each category individually. Each category must have a folder under `tests/malware_analyzer/pypi/resources/sourcecode_samples`.
- Within these folders, there must be sample code patterns for testing, and a file `expected_results.json` with the expected JSON output of the analyzer for that category.
- Each sample code pattern `.py` file must not have executable permissions and must include code that prevents it from being accidentally imported or run. The current files use this method:

```
"""
Running this code will not produce any malicious behavior, but code isolation measures are
in place for safety.
"""

import sys

# ensure no symbols are exported so this code cannot accidentally be used
__all__ = []
sys.exit()

def test_function():
"""
All code to be tested will be defined inside this function, so it is all local to it. This is
to isolate the code to be tested, as it exists to replicate the patterns present in malware
samples.
"""
sys.exit()
```
>>>>>>> ae5a748 (docs: updated README and CONTRIBUTING for information on how to contribute to the malware analyzer)

### Confidence Score Motivation

The original seven heuristics which started this work were Empty Project Link, Unreachable Project Links, One Release, High Release Frequency, Unchange Release, Closer Release Join Date, and Suspicious Setup. These heuristics (excluding those with a dependency) were run on 1167 packages from trusted organizations, with the following results:
Expand Down
3 changes: 3 additions & 0 deletions src/macaron/malware_analyzer/pypi_heuristics/heuristics.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ class Heuristics(str, Enum):
#: Indicates that the package has an unusually large version number for a single release.
ANOMALOUS_VERSION = "anomalous_version"

#: Indicates that the package source code contains suspicious code patterns.
SUSPICIOUS_PATTERNS = "suspicious_patterns"


class HeuristicResult(str, Enum):
"""Result type indicating the outcome of a heuristic."""
Expand Down
Loading
Loading