Fix definition and implementation of divergence #137

Pennycook · 2024-12-09T14:35:09Z

In the original paper, Harrell and Kitson define code divergence imprecisely in terms of pairs (i, j) in a set. In our follow-up "Navigating P3" paper, we attempted to make this more precise and in so doing introduced an error. Specifically, we defined code divergence as drawing all pairs {i, j} from the cartesian product H x H, which would have included {i, j} and {j, i}.

The computation of code divergence in Code Base Investigator was mostly correct, but the incorrect notation was copied into the documentation. The implementation also included a small bug, in that it evaluated the divergence for the empty set and single-item sets as 0 instead of NaN.

These issues were discovered while attempting to rewrite the divergence() routine, and so this commit adds some regression tests to ensure that divergence() produces the correct results.

Related issues

N/A

Proposed changes

Fixes divergence() by returning NaN for the empty set or sets containing a single platform.
Fix the notation in the documentation, by replacing the cartesian product with k-subset notation.
Add some simple regression tests for divergence calculations.

As an aside, to provide a bit more context: I was trying to reimplement divergence() in such a way that it would naturally return a 0 when computing the divergence between a platform and itself, and convinced myself that we could in fact iterate over all pairs in the cartesian product because the distance metric is pairwise-symmetric. Unfortunately that doesn't work, because every time a platform is paired with itself, it adds 0 to the numerator and 1 to the denominator.

I decided that the correct thing to return from this function for now is NaN, because that's consistent with the math. If we wanted to return something else, we'd need to update the math, and I'm not sure how to justify that at this point. Returning NaN will encourage us to be more precise in our future work, because we will be forced to confront that we have defined code divergence as an average pairwise distance.

In the original paper, Harrell and Kitson define code divergence imprecisely in terms of pairs (i, j) in a set. In our follow-up "Navigating P3" paper, we attempted to make this more precise and in so doing introduced an error. Specifically, we defined code divergence as drawing all pairs {i, j} from the cartesian product H x H, which would have included {i, j} and {j, i}. The computation of code divergence in Code Base Investigator was mostly correct, but the incorrect notation was copied into the documentation. The implementation also included a small bug, in that it evaluated the divergence for the empty set and single-item sets as 0 instead of nan. These issues were discovered while attempting to rewrite the divergence() routine, and so this commit adds some regression tests to ensure that divergence() produces the correct results. Signed-off-by: John Pennycook <[email protected]>

laserkelvin

Just left a suggestion

codebasin/report.py

Pennycook added bug Something isn't working documentation Improvements or additions to documentation labels Dec 9, 2024

Pennycook added this to the 2.0.0 milestone Dec 9, 2024

Pennycook requested a review from laserkelvin December 9, 2024 14:35

laserkelvin approved these changes Dec 11, 2024

View reviewed changes

codebasin/report.py Show resolved Hide resolved

Pennycook merged commit 73dd406 into intel:main Dec 11, 2024
3 checks passed

Pennycook deleted the code-divergence branch December 11, 2024 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix definition and implementation of divergence #137

Fix definition and implementation of divergence #137

Pennycook commented Dec 9, 2024

laserkelvin left a comment

Fix definition and implementation of divergence #137

Fix definition and implementation of divergence #137

Conversation

Pennycook commented Dec 9, 2024

Related issues

Proposed changes

laserkelvin left a comment

Choose a reason for hiding this comment