Skip to content

feat(heuristics): add SimilarProjectAnalyzer to detect structural similarity across packages from same maintainer #1089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

AmineRaouane
Copy link
Member

Summary

This PR adds a new heuristic analyzer called SimilarProjectAnalyzer. It checks whether a PyPI package has a similar file/folder structure to other packages maintained by the same user. This helps in identifying potentially malicious packages that replicate existing structures.

Description of changes

  • Created a new analyzer: SimilarProjectAnalyzer.
  • The analyzer fetches the list of maintainers of the target package and retrieves other packages published by those maintainers.
  • For each package, it computes a normalized structure hash from its sdist tarball and compares it to the structure hash of the target package.
  • If any match is found, the heuristic fails, flagging potential structural duplication.
  • Added this analyzer to the heuristics.py registry.
  • Modified detect_malicious_metadata_check.py to include and utilize the new heuristic.
  • Added test cases to validate the functionality and edge cases of the analyzer.

Related issues

None

  • I have reviewed the contribution guide.
  • My PR title and commits follow the Conventional Commits convention.
  • My commits include the "Signed-off-by" line.
  • I have signed my commits following the instructions provided by GitHub. Note that we run GitHub's commit verification tool to check the commit signatures. A green verified label should appear next to all of your commits on GitHub.
  • I have updated the relevant documentation, if applicable.
  • I have tested my changes and verified they work as expected.

…ilarity across packages from same maintainer

Signed-off-by: Amine <[email protected]>
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label May 20, 2025
list[str]: A list of maintainers.
"""
url = f"https://pypi.org/project/{package_name}/"
response = requests.get(url, timeout=10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you have a look at src/macaron/util.py and see if any of these functions may be imported and used instead of directly using the requests package?

}
return HeuristicResult.PASS, {}

def get_maintainers(self, package_name: str) -> list[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this functionality is offered by PyPIRegistry.get_maintainers_of_package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. That is what I used first, but @behnazh-w told me that get_maintainers_of_package is not working anymore because PYPI blocks it so I rewrite it there .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this code can obtain the maintainer info, rather than adding a new function, please update the PyPIRegistry.get_maintainers_of_package function so other heuristics can benefit from it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I change it here, or create a new PR for it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either works, although a separate PR is preferable.

list[str]: A list of package names.
"""
url = f"https://pypi.org/user/{username}/"
response = requests.get(url, timeout=10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here with using util.py.

similar_projects_set.discard(package_name)
return list(similar_projects_set)

def fetch_sdist_url(self, package_name: str, version: str | None = None) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be possible to create a PyPIPackageJsonAsset object so you may use its function get_sourcecode_url for this?

return None

try:
response = requests.get(sdist_url, stream=True, timeout=10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here with using util.py

@@ -381,6 +383,10 @@ def run_check(self, ctx: AnalyzeContext) -> CheckResultData:
failed({Heuristics.CLOSER_RELEASE_JOIN_DATE.value}),
forceSetup.

% Package released that is similar to other packages maintained by the same maintainer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you walk me through the rationale of why we should combine SIMILAR_PROJECTS failing with the forceSetup and quickUndetailed rules and why these rules together are a malicious indicator with HIGH confidence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants