Use variance-share rule for large-target warning#338
Use variance-share rule for large-target warning#338neuralsorcerer wants to merge 1 commit intofacebookresearch:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates Sample.adjust()’s “large target” warning to use a variance-share criterion (target’s contribution to the two-sample 1/n term) instead of fixed row-count and ratio cutoffs, making the warning depend on estimated inferential impact rather than absolute dataset size.
Changes:
- Added
NEGLIGIBLE_TARGET_VARIANCE_FRACTION = 0.05and_target_variance_component_fraction(sample_n, target_n)inbalance/sample_class.py. - Updated the warning trigger in
Sample.adjust()to fire when the target’s estimated1/nvariance share is ≤ 5%, and updated the warning message to include the estimated percentage. - Expanded/updated tests to cover the helper function, the new boundary behavior, and the warning message content; updated the changelog entry accordingly.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
balance/sample_class.py |
Introduces variance-share helper/constant and updates Sample.adjust() warning logic + message. |
tests/test_sample.py |
Adds unit tests for the helper and updates/adds warning-behavior tests for the new rule and message. |
CHANGELOG.md |
Documents the user-visible change to the large-target warning heuristic. |
There was a problem hiding this comment.
The calculations are correct, but in practice I wouldn't change one heuristic to another without some reference.
In practice, we are using n_sample + n_target observations for some classification model (logistic regression, or random forest, XGboost, etc.)
The question is which rules of thumb are in the literature about how to measure (and discuss), how much adding more of the 'negatives' (i.e., target), changes the accuracy of the model.
If you want this (or similar) PR to land - could you first please do a quick deep research literature review (could be gemini based etc) to understand what's some best-practice for describing (/deciding) on this?
The core question to ask is:
if we do a classification model with positives and negatives. At which point does it not matter much if we add more negative examples.
Replaced the arbitrary large-target warning heuristic in
Sample.adjust()with a principled variance-share rule:NEGLIGIBLE_TARGET_VARIANCE_FRACTION = 0.05._target_variance_component_fraction(sample_n, target_n)to compute the target’s share of the two-sample1/nvariance term.>100kand>=10xcutoffs.