Skip to content

Fix over-rejecting p-values for penalized terms (Wood 2013 Tr statistic, fixes #163)#583

Open
RogerPR wants to merge 1 commit into
dswah:mainfrom
RogerPR:pval_update
Open

Fix over-rejecting p-values for penalized terms (Wood 2013 Tr statistic, fixes #163)#583
RogerPR wants to merge 1 commit into
dswah:mainfrom
RogerPR:pval_update

Conversation

@RogerPR

@RogerPR RogerPR commented May 18, 2026

Copy link
Copy Markdown

Summary

Fixes the long-standing p-value miscalibration tracked in #163 — the warning
KNOWN BUG: p-values computed in this summary are likely much smaller than they should be that currently appears on every gam.summary() call.

GAM._compute_p_value previously referenced the test statistic against a
chi-square (or F) distribution with df = rank(cov_term), i.e. the number
of basis functions for the term. For penalized fits with estimated
smoothing parameters this over-counts the effective degrees of freedom and
makes the null distribution far too tight, so noise features routinely
report p ≈ 0.

This PR replaces the implementation with the Tr statistic from Wood
(2013), "On p-values for smooth components of an extended generalized
additive model"
(Biometrika 100(1), 221–228), which is what
mgcv::summary.gam uses:

  • Build the rank-r pseudoinverse of the term's posterior covariance from
    the top-r eigencomponents, with r = round(edof_term).
  • Compute T_r = β_term^T V^{-r} β_term.
  • Reference T_r against χ²(r).

edof_term is read from the existing statistics_["edof_per_coef"],
which is already computed during fit. When edof_per_coef is shorter
than coef_ (intercept term, or more splines than samples) we fall back
to the nominal coefficient count, preserving existing behavior in those
edge cases.

Empirical validation

Pre-existing regression test (test_pvalue_rejects_useless_feature):
the "useless" np.arange feature on the wage dataset now reports
p ≈ 0.84 (was p ≈ 5 × 10⁻²⁹⁷ on the buggy implementation that originally
prompted #163).

Real signals stay highly significant: on wage, s(year) + s(age) + f(education) reports p-values < 10⁻¹² for every real term.

Calibration under H₀ — new pygam/tests/test_pvalue.py runs 100
seeded simulations per scenario and checks that the false-positive rate
sits in [1%, 10%] around the nominal 5%:

scenario FPR / Power
univariate noise, lam=0 ~5%
univariate noise, lam=0.6 ~5%
univariate noise, n_splines=25 ~5%
two-term noise, both terms ~5%
spline + factor on noise ~5%
near-collinear predictors ~5%
strong sine signal (power) ≥95%

Cross-check against mgcv (50 trials × 3 penalties on y = 0.3·sin(x) + N(0,1)
with n=200, x ∈ [0,10]; for each trial R's sp was optimized so
sum(model$edf) matched pyGAM's statistics_["edof"], then
summary.gam()$s.table[,"p-value"] was read off):

lam mean |p_pygam − p_R| decision agreement Pearson r
0.6 0.053 84% 0.974
1.0 0.073 92% 0.904
3.0 0.092 78% 0.893

(harness not included in the PR — happy to share on request)

Residual gap to mgcv comes from differences in basis/penalty
parameterization and from mgcv's frequentist-covariance refinement; both
are out of scope for this PR.

Test plan

  • pytest pygam/ — 170 passed, 1 skipped (was 169/1 fail/1 skip before)
  • pytest pygam/tests/test_pvalue.py -v — 8 passed
  • ruff check pygam/pygam.py pygam/tests/test_pvalue.py — clean
  • ruff format --check pygam/pygam.py pygam/tests/test_pvalue.py — clean
  • Spot-checked gam.summary() on the wage dataset — output is sensible

  Replaces _compute_p_value with the Tr test from Wood (2013, Biometrika
  100(1)), as used by mgcv::summary.gam. The rank of the term's covariance
  pseudoinverse is taken as round(edof_term) instead of the matrix rank;
  the statistic is referenced against chi-square with that many df.

  Resolves the over-rejection described in dswah#163: previously, terms with
  estimated smoothing parameters could report p ≈ 0 even when the
  underlying effect was pure noise. With the corrected df, FPR on
  realistic multi-term fits is back near the nominal 5% level, and a
  50-trial comparison against mgcv shows Pearson correlations of
  0.89-0.97 with mean |p_pygam - p_R| of 0.05-0.09 across lam in
  {0.6, 1.0, 3.0}.

  Adds pygam/tests/test_pvalue.py with FPR/power calibration tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant