feature_selection: validate binary outcome in filter_LR / filter_D (#349)#898
Open
jbbqqf wants to merge 1 commit intouber:masterfrom
Open
feature_selection: validate binary outcome in filter_LR / filter_D (#349)#898jbbqqf wants to merge 1 commit intouber:masterfrom
jbbqqf wants to merge 1 commit intouber:masterfrom
Conversation
…ber#349) The LR and bin-based divergence (KL/ED/Chi) filters silently mis-handle non-binary outcomes: ``filter_LR`` propagates a ``statsmodels`` Logit error from deep in the call stack, and ``filter_D`` produces nonsense scores because ``_GetNodeSummary`` only counts ``y == 0`` and ``y == 1`` rows. Add an explicit ``_check_binary_outcome`` validator called from both public entry points, and document the limitation in the module docstring. ``filter_F`` (OLS) is unaffected — it tolerates continuous outcomes and a regression test guards against accidental over-validation. Co-Authored-By: Claude Code <noreply@anthropic.com>
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
The likelihood-ratio filter (
filter_LR) and bin-based divergence filters(
filter_Dwith methodKL/ED/Chi) only support binary outcomes,but the public API does not validate this. Passing a multi-class label set
(e.g.
{0, 1, 2, 3}) silently produces meaningless feature-importance scoresor surfaces an inscrutable
statsmodelserror from deep in the call stack.This PR adds a
_check_binary_outcomevalidator at bothfilter_LRandfilter_Dentry points so the user gets a clearValueErrornaming theoffending column, the limitation, and the supported alternative
(
filter_Ffor continuous outcomes). The module docstring is also updatedto document the constraint.
Fixes #349 — Clarify the limitations (specifically the label) of the current
implementation of filter methods
Types of changes
Context
The OP cited
causalml/feature_selection/filters.py:210(thesm.Logitline,now around L199 on master) and noted:
The bin-based path is even more silent:
_GetNodeSummary(filters.py:302) doesresults_series = data.groupby([experiment_group_column, y_name]).size()andthen
results[ti].get(1, 0)/results[ti].get(0, 0)— every value otherthan 0 or 1 is silently dropped from the per-bin probability calculation.
filter_Fuses OLS and is documented as compatible with continuous outcomes,so it is not validated. A new test guards against accidentally tightening
that.
Changes
causalml/feature_selection/filters.pynote::block documentingthe binary-outcome constraint and pointing to
filter_Ffor continuousoutcomes.
_check_binary_outcome(y, y_name)that dropsNaNs, computes the unique label set, and raises
ValueErrorif anyvalue other than
{0, 1}is present.filter_LRandfilter_Dcall the validator before any work.tests/test_feature_selection.pytest_filter_rejects_non_binary_outcomecovering
LR,KL,ED,Chi— each fails onmaster(noValueErroris raised) and passes on this branch.
test_filter_f_accepts_continuous_outcometo lock in thatfilter_Fis unaffected.
Reproduce BEFORE/AFTER yourself (copy-paste)
What I ran locally
pytest tests/test_feature_selection.py -v→8/8 passed on this branch (was 3/3 before; added 5 parametrized + lock-in tests).
pytest tests/test_feature_selection.py::test_filter_rejects_non_binary_outcome -von
origin/master→ 4/4 FAIL (noValueErrorraised; silent or wrong-error path).black --fast causalml/feature_selection/filters.py tests/test_feature_selection.py→ clean.Edge cases tested
ValueError("...binary...")test_filter_rejects_non_binary_outcome[LR]ValueError("...binary...")test_filter_rejects_non_binary_outcome[KL/ED/Chi]test_filter_f_accepts_continuous_outcometest_filter_lr(existing, unchanged)test_filter_kl(existing, unchanged)Risk / blast radius
outcomes are unaffected.
filter_Ffor the continuous-outcome use case, so the user-experience improvement
is two-way: the limitation is documented (issue Clarify the limitations (specifically the label) of the current implementation of
filter methods#349's primary ask)and the failure mode becomes loud rather than silent.
Release note
PR drafted with assistance from Claude Code. The change was reviewed manually
against
causalml/feature_selection/filters.py(thesm.Logitand_GetNodeSummarypaths cited in the issue) and the existingtests/test_feature_selection.pypatterns. The reproducer block above wasused during development; it is the same one a reviewer can paste verbatim.