-
Notifications
You must be signed in to change notification settings - Fork 13
fix: Improve renamed package detection #575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix: Improve renamed package detection #575
Conversation
Shabnam Behal seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
@behalshabnam Thanks very much for working on this. I have a little bit of a PR backlog, but I will try to get to this soon. |
@behalshabnam Sorry I haven't had a chance to review this. I am getting ready to travel, and I will get back early next week. I will make it a priority to review this then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behalshabnam Thanks for your patience, thanks a lot for working on this, and I hope my comments aren't too rambly. 😬
src/lib.rs
Outdated
/// Checks if two lists of authors have at least one author in common. | ||
fn have_common_author(authors1: &[String], authors2: &[&str]) -> bool { | ||
for author1 in authors1 { | ||
if authors2.contains(&author1.as_str()) { | ||
return true; | ||
} | ||
} | ||
false | ||
} | ||
|
||
/// Checks if two strings have high textual similarity. | ||
/// Returns true if they share a significant portion of words. | ||
fn high_similarity(s1: &str, s2: &str) -> bool { | ||
let s1_words: HashSet<&str> = s1.split_whitespace().collect(); | ||
let s2_words: HashSet<&str> = s2.split_whitespace().collect(); | ||
|
||
if s1_words.is_empty() || s2_words.is_empty() { | ||
return false; | ||
} | ||
|
||
let common_words = s1_words.intersection(&s2_words).count(); | ||
let min_words = s1_words.len().min(s2_words.len()); | ||
|
||
// Avoid precision loss by doing integer division first | ||
let similarity = if min_words > 0 { | ||
(common_words * SIMILARITY_SCALE) / min_words | ||
} else { | ||
0 | ||
}; | ||
|
||
similarity > SIMILARITY_THRESHOLD | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you put a lot of work into this. Thank you for that. But let's please just do exact comparisons for now.
src/lib.rs
Outdated
}; | ||
|
||
// Check repository URL (if present in both) | ||
if let (Some(original_repo), Some(candidate_repo)) = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to check all fields of a package (except name
, of course)? https://docs.rs/cargo_metadata/0.19.2/cargo_metadata/struct.Package.html
Ideally, this would be done with a macro, so that resulting code would look something like:
check!(pkg, pkg_table, version);
check!(pkg, pkg_table, authors);
...
But writing that macro could be tricky. I would expect it to look something like:
macro_rules! check {
($pkg:expr, $pkg_table:expr, $field:ident) => {{
// ???
}};
}
But I'm not sure what goes in the // ???
.
Referring to the field in the pkg
should be easy, something like:
$pkg.$field
But referring to the field in the pkg_table
will be more tricky, maybe something like:
<_ as serde::Deserialize>::deserialize(
$pkg_table.get(stringify!($field)).unwrap().into_deserializer()
)
Those two values would be compared and if they differ, the macro should return false
.
I wrote the above assuming that the types of the values being compared does not have to be named.
If naming the types can't be avoided, then the code will have to be more verbose, something like:
check!(pkg, pkg_table, version, Version);
check!(pkg, pkg_table, authors, Vec<String>);
...
And the macro definition would have to change too, of course.
But I'm hopeful that having to name the types can be avoided.
Does the macro approach make sense to you?
Do you have experience writing macros, and would you be willing to try to tackle it?
Hi @smoelius, Thanks so much for the review and feedback! I really appreciate you taking the time to look through this. Regarding the The idea of checking all relevant The I'll start working on these updates based on your feedback. Thanks again! |
You could do that, but I am afraid it would be tedious. I would suggest to go the other way around and try to get the macro working first. If it seems easier, you could try to get it working for one particular field so that you know the field's type, and then abstract away from there. The deserialization stuff looks scarier than it is. It's just to turn a For example, if you call But what you really want is So, to do the conversion, you can convert the And then you can deserialize that BTW, there was a bug in what I wrote in #575 (comment). I should have written: $pkg_table.get(stringify!($field)).map(|value| {
<_ as serde::Deserialize>::deserialize(value.into_deserializer())
}).unwrap_or_default() The reason is: if I.e., we want to compare
You are welcome to ask an unlimited number of questions. I am genuinely curious to see how this turns out! |
Hi @smoelius, @behalshabnam here, using an alt account. I wanted to reach out because I’m currently unable to post any comments on any PR threads in the Trail of Bits repositories. I initially tried commenting on a separate PR in the Interestingly, I noticed that the CLA bot appears to have signed me out as well. To rule out a problem with my GitHub account, I tested commenting in a different organization’s repository—and that worked fine. If you're seeing this message, could you please check if there’s an issue on your end or let me know what might be going on? I’ve been running into this for about two weeks now. Thanks |
@behalshabnam @behalshabnam-alt trailofbits/vast#787 was deemed suspicious, and you've been banned from I do not have the authority or the permissions to unban you. I spoke with those who do, and they felt the decision should stand. I will take over this PR. I do want to thank you for your work on it, though. |
Thank you, @smoelius, for the clarification and your support. I did not expect that an unintentional mistake in I had assumed that any issues in the commit would be identified during the review process, and I would have the opportunity to address them accordingly—just as we did here. Nonetheless, I fully respect the team's decision and recognize the importance of upholding the organization's standards and practices.
Thank you for taking over this work. I hope this PR will eventually be merged and prove useful. |
This PR solves #441 by improving the package detection logic.
Package Detection Improvements
The package verification now uses multiple identifiers to establish package identity:
Repository URL Matching
Author Verification
Version Correlation
Description Similarity Analysis
Implementation Details
is_same_package_except_name
function for comprehensive package verificationSIMILARITY_SCALE
(100) andSIMILARITY_THRESHOLD
(30) constants for description matchinghave_common_author
: Checks author overlaphigh_similarity
: Performs description similarity analysisTesting
renamed_package_not_flagged
to verify behavioricu-rename
fixture to validate package detectionExample
A package like
icu_locid
that was previously incorrectly flagged as "not in repository" is now properly recognized when: