Skip to content

Use variance-share rule for large-target warning#338

Closed
neuralsorcerer wants to merge 1 commit intofacebookresearch:mainfrom
neuralsorcerer:var
Closed

Use variance-share rule for large-target warning#338
neuralsorcerer wants to merge 1 commit intofacebookresearch:mainfrom
neuralsorcerer:var

Conversation

@neuralsorcerer
Copy link
Collaborator

Replaced the arbitrary large-target warning heuristic in Sample.adjust() with a principled variance-share rule:

  • Added NEGLIGIBLE_TARGET_VARIANCE_FRACTION = 0.05.
  • Added _target_variance_component_fraction(sample_n, target_n) to compute the target’s share of the two-sample 1/n variance term.
  • Updated warning trigger logic to fire when target variance contribution is negligible (<=5%) instead of fixed >100k and >=10x cutoffs.
  • Updated the warning text to report the estimated target variance share percentage.

Copilot AI review requested due to automatic review settings February 16, 2026 11:02
@meta-cla meta-cla bot added the cla signed label Feb 16, 2026
@neuralsorcerer neuralsorcerer added this to the balance 0.17.0 milestone Feb 16, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Sample.adjust()’s “large target” warning to use a variance-share criterion (target’s contribution to the two-sample 1/n term) instead of fixed row-count and ratio cutoffs, making the warning depend on estimated inferential impact rather than absolute dataset size.

Changes:

  • Added NEGLIGIBLE_TARGET_VARIANCE_FRACTION = 0.05 and _target_variance_component_fraction(sample_n, target_n) in balance/sample_class.py.
  • Updated the warning trigger in Sample.adjust() to fire when the target’s estimated 1/n variance share is ≤ 5%, and updated the warning message to include the estimated percentage.
  • Expanded/updated tests to cover the helper function, the new boundary behavior, and the warning message content; updated the changelog entry accordingly.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
balance/sample_class.py Introduces variance-share helper/constant and updates Sample.adjust() warning logic + message.
tests/test_sample.py Adds unit tests for the helper and updates/adds warning-behavior tests for the new rule and message.
CHANGELOG.md Documents the user-visible change to the large-target warning heuristic.

Copy link
Contributor

@talgalili talgalili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculations are correct, but in practice I wouldn't change one heuristic to another without some reference.
In practice, we are using n_sample + n_target observations for some classification model (logistic regression, or random forest, XGboost, etc.)
The question is which rules of thumb are in the literature about how to measure (and discuss), how much adding more of the 'negatives' (i.e., target), changes the accuracy of the model.
If you want this (or similar) PR to land - could you first please do a quick deep research literature review (could be gemini based etc) to understand what's some best-practice for describing (/deciding) on this?

The core question to ask is:
if we do a classification model with positives and negatives. At which point does it not matter much if we add more negative examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants