Skip to content

fix: Improve renamed package detection #575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

behalshabnam
Copy link

This PR solves #441 by improving the package detection logic.

Package Detection Improvements

The package verification now uses multiple identifiers to establish package identity:

  1. Repository URL Matching

    • Primary identifier for package identity
    • Handles cases where package names change but repository remains the same
  2. Author Verification

    • Checks for common authors between versions
    • Helps verify package lineage across renames
  3. Version Correlation

    • Matches exact versions to ensure continuity
    • Prevents false positives from similarly named packages
  4. Description Similarity Analysis

    • Uses text similarity matching for package descriptions
    • Threshold-based comparison (30%) to accommodate minor description updates
    • Helps confirm package identity when other metadata changes

Implementation Details

  • Added is_same_package_except_name function for comprehensive package verification
  • Introduced SIMILARITY_SCALE (100) and SIMILARITY_THRESHOLD (30) constants for description matching
  • Implemented helper functions:
    • have_common_author: Checks author overlap
    • high_similarity: Performs description similarity analysis

Testing

  • Added integration test renamed_package_not_flagged to verify behavior
  • Test uses the icu-rename fixture to validate package detection
  • Confirms that renamed packages are not incorrectly flagged as unmaintained

Example

A package like icu_locid that was previously incorrectly flagged as "not in repository" is now properly recognized when:

  • The repository URL matches
  • Authors overlap with the original package
  • Package descriptions are sufficiently similar

@CLAassistant
Copy link

CLAassistant commented Apr 9, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Shabnam Behal seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@behalshabnam
Copy link
Author

Hello @smoelius,

Here is the PR for #441. Please review and let me know if anything needs to be changed.

Thank You!

@smoelius
Copy link
Collaborator

@behalshabnam Thanks very much for working on this. I have a little bit of a PR backlog, but I will try to get to this soon.

@smoelius
Copy link
Collaborator

@behalshabnam Sorry I haven't had a chance to review this. I am getting ready to travel, and I will get back early next week. I will make it a priority to review this then.

Copy link
Collaborator

@smoelius smoelius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behalshabnam Thanks for your patience, thanks a lot for working on this, and I hope my comments aren't too rambly. 😬

src/lib.rs Outdated
Comment on lines 972 to 1003
/// Checks if two lists of authors have at least one author in common.
fn have_common_author(authors1: &[String], authors2: &[&str]) -> bool {
for author1 in authors1 {
if authors2.contains(&author1.as_str()) {
return true;
}
}
false
}

/// Checks if two strings have high textual similarity.
/// Returns true if they share a significant portion of words.
fn high_similarity(s1: &str, s2: &str) -> bool {
let s1_words: HashSet<&str> = s1.split_whitespace().collect();
let s2_words: HashSet<&str> = s2.split_whitespace().collect();

if s1_words.is_empty() || s2_words.is_empty() {
return false;
}

let common_words = s1_words.intersection(&s2_words).count();
let min_words = s1_words.len().min(s2_words.len());

// Avoid precision loss by doing integer division first
let similarity = if min_words > 0 {
(common_words * SIMILARITY_SCALE) / min_words
} else {
0
};

similarity > SIMILARITY_THRESHOLD
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you put a lot of work into this. Thank you for that. But let's please just do exact comparisons for now.

src/lib.rs Outdated
};

// Check repository URL (if present in both)
if let (Some(original_repo), Some(candidate_repo)) = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to check all fields of a package (except name, of course)? https://docs.rs/cargo_metadata/0.19.2/cargo_metadata/struct.Package.html

Ideally, this would be done with a macro, so that resulting code would look something like:

check!(pkg, pkg_table, version);
check!(pkg, pkg_table, authors);
...

But writing that macro could be tricky. I would expect it to look something like:

macro_rules! check {
    ($pkg:expr, $pkg_table:expr, $field:ident) => {{
        // ???
    }};
}

But I'm not sure what goes in the // ???.

Referring to the field in the pkg should be easy, something like:

$pkg.$field

But referring to the field in the pkg_table will be more tricky, maybe something like:

<_ as serde::Deserialize>::deserialize(
    $pkg_table.get(stringify!($field)).unwrap().into_deserializer()
)

Those two values would be compared and if they differ, the macro should return false.


I wrote the above assuming that the types of the values being compared does not have to be named.

If naming the types can't be avoided, then the code will have to be more verbose, something like:

check!(pkg, pkg_table, version, Version);
check!(pkg, pkg_table, authors, Vec<String>);
...

And the macro definition would have to change too, of course.

But I'm hopeful that having to name the types can be avoided.


Does the macro approach make sense to you?

Do you have experience writing macros, and would you be willing to try to tackle it?

@behalshabnam
Copy link
Author

Hi @smoelius,

Thanks so much for the review and feedback! I really appreciate you taking the time to look through this.

Regarding the high_similarity check, my initial plan was to catch cases with minor description tweaks, but I agree that a stricter comparison across fields is a more simpler approach.

The idea of checking all relevant Package makes sense for confirming package identity more reliably.

The check! macro approach looks like a good way to keep the code DRY. I understand the concept, but I'm not very familiar with writing macros that involve dynamic deserialization like that. Would you be open to me first implementing the exact checks for the relevant fields directly (version, authors, repository, description, license, etc.)? Once we're sure the core logic is correct, we could potentially look at refactoring it into a macro together? The dynamic deserialization part (<_ as serde::Deserialize>::deserialize(...)) seems like the main challenge, but I'll give it my best shot once I get there.

I'll start working on these updates based on your feedback.

Thanks again!

@smoelius
Copy link
Collaborator

Would you be open to me first implementing the exact checks for the relevant fields directly (version, authors, repository, description, license, etc.)?

You could do that, but I am afraid it would be tedious.

I would suggest to go the other way around and try to get the macro working first. If it seems easier, you could try to get it working for one particular field so that you know the field's type, and then abstract away from there.

The deserialization stuff looks scarier than it is. It's just to turn a toml::Value into the type that you actually want.

For example, if you call pkg_table.get("authors"), you will get back Option<&Value>: https://docs.rs/toml/latest/toml/map/struct.Map.html#method.get

But what you really want is Option<Vec<String>>, because Vec<String> is the type of the authors field in cargo_metatada::Package.

So, to do the conversion, you can convert the toml::Value into a serde::Deserializer using into_deserializer: https://docs.rs/toml/latest/toml/enum.Value.html#method.into_deserializer

And then you can deserialize that Deserializer using <Vec<String> as serde::Deserialize>::deserialize (but I am hoping the Vec<String> can be omitted and replaced with _ and that Rust can infer the type).


BTW, there was a bug in what I wrote in #575 (comment). I should have written:

$pkg_table.get(stringify!($field)).map(|value| {
        <_ as serde::Deserialize>::deserialize(value.into_deserializer())
    }).unwrap_or_default()

The reason is: if $field is not present in pkg_table, we don't want to panic or reject; we want to use the default value for that field.

I.e., we want to compare $pkg.$field to the default value for that field.


Once we're sure the core logic is correct, we could potentially look at refactoring it into a macro together?

You are welcome to ask an unlimited number of questions. I am genuinely curious to see how this turns out!

@behalshabnam-alt
Copy link

Hi @smoelius, @behalshabnam here, using an alt account.

I wanted to reach out because I’m currently unable to post any comments on any PR threads in the Trail of Bits repositories. I initially tried commenting on a separate PR in the cargo-unmaintained repo, but it didn’t go through. I then tested other PRs across different Trail of Bits repos and faced the same issue.

Interestingly, I noticed that the CLA bot appears to have signed me out as well. To rule out a problem with my GitHub account, I tested commenting in a different organization’s repository—and that worked fine.

If you're seeing this message, could you please check if there’s an issue on your end or let me know what might be going on? I’ve been running into this for about two weeks now.

Thanks

@smoelius
Copy link
Collaborator

smoelius commented May 9, 2025

@behalshabnam @behalshabnam-alt trailofbits/vast#787 was deemed suspicious, and you've been banned from trailofbits repositories as a result.

I do not have the authority or the permissions to unban you. I spoke with those who do, and they felt the decision should stand.

I will take over this PR. I do want to thank you for your work on it, though.

@behalshabnam-alt
Copy link

Thank you, @smoelius, for the clarification and your support. I did not expect that an unintentional mistake in vast would have such an impact across the entire Trail of Bits organization. That was my first contribution to vast, and I was relatively new to that field.

I had assumed that any issues in the commit would be identified during the review process, and I would have the opportunity to address them accordingly—just as we did here.

Nonetheless, I fully respect the team's decision and recognize the importance of upholding the organization's standards and practices.

I will take over this PR

Thank you for taking over this work. I hope this PR will eventually be merged and prove useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants