Fix definition and implementation of divergence #137
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the original paper, Harrell and Kitson define code divergence imprecisely in terms of pairs (i, j) in a set. In our follow-up "Navigating P3" paper, we attempted to make this more precise and in so doing introduced an error. Specifically, we defined code divergence as drawing all pairs {i, j} from the cartesian product H x H, which would have included {i, j} and {j, i}.
The computation of code divergence in Code Base Investigator was mostly correct, but the incorrect notation was copied into the documentation. The implementation also included a small bug, in that it evaluated the divergence for the empty set and single-item sets as 0 instead of NaN.
These issues were discovered while attempting to rewrite the divergence() routine, and so this commit adds some regression tests to ensure that divergence() produces the correct results.
Related issues
N/A
Proposed changes
divergence()
by returning NaN for the empty set or sets containing a single platform.As an aside, to provide a bit more context: I was trying to reimplement
divergence()
in such a way that it would naturally return a 0 when computing the divergence between a platform and itself, and convinced myself that we could in fact iterate over all pairs in the cartesian product because the distance metric is pairwise-symmetric. Unfortunately that doesn't work, because every time a platform is paired with itself, it adds 0 to the numerator and 1 to the denominator.I decided that the correct thing to return from this function for now is NaN, because that's consistent with the math. If we wanted to return something else, we'd need to update the math, and I'm not sure how to justify that at this point. Returning NaN will encourage us to be more precise in our future work, because we will be forced to confront that we have defined code divergence as an average pairwise distance.