Fix over-rejecting p-values for penalized terms (Wood 2013 Tr statistic, fixes #163)#583
Open
RogerPR wants to merge 1 commit into
Open
Fix over-rejecting p-values for penalized terms (Wood 2013 Tr statistic, fixes #163)#583RogerPR wants to merge 1 commit into
RogerPR wants to merge 1 commit into
Conversation
Replaces _compute_p_value with the Tr test from Wood (2013, Biometrika 100(1)), as used by mgcv::summary.gam. The rank of the term's covariance pseudoinverse is taken as round(edof_term) instead of the matrix rank; the statistic is referenced against chi-square with that many df. Resolves the over-rejection described in dswah#163: previously, terms with estimated smoothing parameters could report p ≈ 0 even when the underlying effect was pure noise. With the corrected df, FPR on realistic multi-term fits is back near the nominal 5% level, and a 50-trial comparison against mgcv shows Pearson correlations of 0.89-0.97 with mean |p_pygam - p_R| of 0.05-0.09 across lam in {0.6, 1.0, 3.0}. Adds pygam/tests/test_pvalue.py with FPR/power calibration tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the long-standing p-value miscalibration tracked in #163 — the warning
KNOWN BUG: p-values computed in this summary are likely much smaller than they should bethat currently appears on everygam.summary()call.GAM._compute_p_valuepreviously referenced the test statistic against achi-square (or F) distribution with
df = rank(cov_term), i.e. the numberof basis functions for the term. For penalized fits with estimated
smoothing parameters this over-counts the effective degrees of freedom and
makes the null distribution far too tight, so noise features routinely
report p ≈ 0.
This PR replaces the implementation with the Tr statistic from Wood
(2013), "On p-values for smooth components of an extended generalized
additive model" (Biometrika 100(1), 221–228), which is what
mgcv::summary.gamuses:the top-r eigencomponents, with
r = round(edof_term).T_r = β_term^T V^{-r} β_term.T_ragainstχ²(r).edof_termis read from the existingstatistics_["edof_per_coef"],which is already computed during
fit. Whenedof_per_coefis shorterthan
coef_(intercept term, or more splines than samples) we fall backto the nominal coefficient count, preserving existing behavior in those
edge cases.
Empirical validation
Pre-existing regression test (
test_pvalue_rejects_useless_feature):the "useless"
np.arangefeature on thewagedataset now reportsp ≈ 0.84 (was p ≈ 5 × 10⁻²⁹⁷ on the buggy implementation that originally
prompted #163).
Real signals stay highly significant: on
wage,s(year) + s(age) + f(education)reports p-values < 10⁻¹² for every real term.Calibration under H₀ — new
pygam/tests/test_pvalue.pyruns 100seeded simulations per scenario and checks that the false-positive rate
sits in
[1%, 10%]around the nominal 5%:lam=0lam=0.6n_splines=25Cross-check against mgcv (50 trials × 3 penalties on
y = 0.3·sin(x) + N(0,1)with n=200, x ∈ [0,10]; for each trial R's
spwas optimized sosum(model$edf)matched pyGAM'sstatistics_["edof"], thensummary.gam()$s.table[,"p-value"]was read off):(harness not included in the PR — happy to share on request)
Residual gap to mgcv comes from differences in basis/penalty
parameterization and from mgcv's frequentist-covariance refinement; both
are out of scope for this PR.
Test plan
pytest pygam/— 170 passed, 1 skipped (was 169/1 fail/1 skip before)pytest pygam/tests/test_pvalue.py -v— 8 passedruff check pygam/pygam.py pygam/tests/test_pvalue.py— cleanruff format --check pygam/pygam.py pygam/tests/test_pvalue.py— cleangam.summary()on thewagedataset — output is sensible