Use variance-share rule for large-target warning by neuralsorcerer · Pull Request #338 · facebookresearch/balance

neuralsorcerer · 2026-02-16T11:02:05Z

Replaced the arbitrary large-target warning heuristic in Sample.adjust() with a principled variance-share rule:

Added NEGLIGIBLE_TARGET_VARIANCE_FRACTION = 0.05.
Added _target_variance_component_fraction(sample_n, target_n) to compute the target’s share of the two-sample 1/n variance term.
Updated warning trigger logic to fire when target variance contribution is negligible (<=5%) instead of fixed >100k and >=10x cutoffs.
Updated the warning text to report the estimated target variance share percentage.

Copilot

Pull request overview

This PR updates Sample.adjust()’s “large target” warning to use a variance-share criterion (target’s contribution to the two-sample 1/n term) instead of fixed row-count and ratio cutoffs, making the warning depend on estimated inferential impact rather than absolute dataset size.

Changes:

Added NEGLIGIBLE_TARGET_VARIANCE_FRACTION = 0.05 and _target_variance_component_fraction(sample_n, target_n) in balance/sample_class.py.
Updated the warning trigger in Sample.adjust() to fire when the target’s estimated 1/n variance share is ≤ 5%, and updated the warning message to include the estimated percentage.
Expanded/updated tests to cover the helper function, the new boundary behavior, and the warning message content; updated the changelog entry accordingly.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
`balance/sample_class.py`	Introduces variance-share helper/constant and updates `Sample.adjust()` warning logic + message.
`tests/test_sample.py`	Adds unit tests for the helper and updates/adds warning-behavior tests for the new rule and message.
`CHANGELOG.md`	Documents the user-visible change to the large-target warning heuristic.

talgalili

The calculations are correct, but in practice I wouldn't change one heuristic to another without some reference.
In practice, we are using n_sample + n_target observations for some classification model (logistic regression, or random forest, XGboost, etc.)
The question is which rules of thumb are in the literature about how to measure (and discuss), how much adding more of the 'negatives' (i.e., target), changes the accuracy of the model.
If you want this (or similar) PR to land - could you first please do a quick deep research literature review (could be gemini based etc) to understand what's some best-practice for describing (/deciding) on this?

The core question to ask is:
if we do a classification model with positives and negatives. At which point does it not matter much if we add more negative examples.

Use variance-share rule for large-target warning

8a5f657

Copilot AI review requested due to automatic review settings February 16, 2026 11:02

meta-cla bot added the cla signed label Feb 16, 2026

Copilot started reviewing on behalf of neuralsorcerer February 16, 2026 11:02 View session

neuralsorcerer added this to the balance 0.17.0 milestone Feb 16, 2026

Copilot AI reviewed Feb 16, 2026

View reviewed changes

neuralsorcerer requested a review from talgalili February 16, 2026 11:07

talgalili requested changes Feb 16, 2026

View reviewed changes

neuralsorcerer closed this Feb 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use variance-share rule for large-target warning#338

Use variance-share rule for large-target warning#338
neuralsorcerer wants to merge 1 commit intofacebookresearch:mainfrom
neuralsorcerer:var

neuralsorcerer commented Feb 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

talgalili left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neuralsorcerer commented Feb 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

talgalili left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

talgalili left a comment •

edited

Loading