Skip to content

Fix definition and implementation of divergence #137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 11, 2024

Conversation

Pennycook
Copy link
Contributor

In the original paper, Harrell and Kitson define code divergence imprecisely in terms of pairs (i, j) in a set. In our follow-up "Navigating P3" paper, we attempted to make this more precise and in so doing introduced an error. Specifically, we defined code divergence as drawing all pairs {i, j} from the cartesian product H x H, which would have included {i, j} and {j, i}.

The computation of code divergence in Code Base Investigator was mostly correct, but the incorrect notation was copied into the documentation. The implementation also included a small bug, in that it evaluated the divergence for the empty set and single-item sets as 0 instead of NaN.

These issues were discovered while attempting to rewrite the divergence() routine, and so this commit adds some regression tests to ensure that divergence() produces the correct results.

Related issues

N/A

Proposed changes

  • Fixes divergence() by returning NaN for the empty set or sets containing a single platform.
  • Fix the notation in the documentation, by replacing the cartesian product with k-subset notation.
  • Add some simple regression tests for divergence calculations.

As an aside, to provide a bit more context: I was trying to reimplement divergence() in such a way that it would naturally return a 0 when computing the divergence between a platform and itself, and convinced myself that we could in fact iterate over all pairs in the cartesian product because the distance metric is pairwise-symmetric. Unfortunately that doesn't work, because every time a platform is paired with itself, it adds 0 to the numerator and 1 to the denominator.

I decided that the correct thing to return from this function for now is NaN, because that's consistent with the math. If we wanted to return something else, we'd need to update the math, and I'm not sure how to justify that at this point. Returning NaN will encourage us to be more precise in our future work, because we will be forced to confront that we have defined code divergence as an average pairwise distance.

In the original paper, Harrell and Kitson define code divergence imprecisely in
terms of pairs (i, j) in a set. In our follow-up "Navigating P3" paper, we
attempted to make this more precise and in so doing introduced an error.
Specifically, we defined code divergence as drawing all pairs {i, j} from the
cartesian product H x H, which would have included {i, j} and {j, i}.

The computation of code divergence in Code Base Investigator was mostly
correct, but the incorrect notation was copied into the documentation.
The implementation also included a small bug, in that it evaluated the
divergence for the empty set and single-item sets as 0 instead of nan.

These issues were discovered while attempting to rewrite the divergence()
routine, and so this commit adds some regression tests to ensure that
divergence() produces the correct results.

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook added bug Something isn't working documentation Improvements or additions to documentation labels Dec 9, 2024
@Pennycook Pennycook added this to the 2.0.0 milestone Dec 9, 2024
@Pennycook Pennycook requested a review from laserkelvin December 9, 2024 14:35
Copy link
Contributor

@laserkelvin laserkelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a suggestion

@Pennycook Pennycook merged commit 73dd406 into intel:main Dec 11, 2024
3 checks passed
@Pennycook Pennycook deleted the code-divergence branch December 11, 2024 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants