Skip to content

Commit 01ce8b4

Browse files
committed
Missing explanations for logistic regression results
1 parent 3f6cc68 commit 01ce8b4

File tree

1 file changed

+40
-1
lines changed

1 file changed

+40
-1
lines changed

content/python_files/prevalence_correction.py

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -375,6 +375,30 @@ def plot_linear_model_parameters(self):
375375
population_comparator.plot_linear_model_parameters()
376376
population_comparator.score_table()
377377

378+
# %% [markdown]
379+
#
380+
# The above results show that naively fitting a logistic regression model on
381+
# the observed data yields a model with very good ranking power (as measured by
382+
# ROC-AUC on the target population) and the feature coefficients seem to be
383+
# well aligned with the parameters of the data generating process.
384+
#
385+
# However, it does not fully recover the true data generating process. In
386+
# particular, the intercept value is not correctly estimated. As a result, the
387+
# value of the log-loss is very poor because the probabilistic predictions are
388+
# not well calibrated.
389+
#
390+
# In the following, we will explore several ways to correct this problem.
391+
392+
# %% [markdown]
393+
#
394+
# ## Weight-based prevalence correction for Logistic Regression
395+
#
396+
# The first approach we will explore is to use class weights to correct for the
397+
# prevalence shift in the training data.
398+
#
399+
# In scikit-learn, class weights are passed as constructor parameter to the
400+
# estimator (one weight per possible class). Here we pass the ratios of each
401+
# class prevalence in the target population and the training set:
378402

379403
# %%
380404
class_weight_for_prevalence_correction = {
@@ -392,10 +416,21 @@ def plot_linear_model_parameters(self):
392416
population_comparator.plot_linear_model_parameters()
393417
population_comparator.score_table()
394418

419+
# %% [markdown]
420+
#
421+
# We can see that passing class-weights to the estimator has the desired effect
422+
# of correcting the prevalence shift in the training data: the ROC-AUC values
423+
# stays roughly the same while the log-loss is significantly reduced and nearly
424+
# matches the expected log-loss of the data generating process when evaluated
425+
# on the target population.
426+
395427
# %% [markdown]
396428
#
397429
# Let's check that we can get exactly the same results using `sample_weight` in
398-
# `fit` instead of `class_weight` in the constructor.
430+
# `fit` instead of `class_weight` in the constructor. We just repeat the
431+
# class-weight for each data point in the training set based on their class
432+
# label. If all goes well, this should be strictly equivalent hence should
433+
# converge exactly to the same model.
399434

400435
# %%
401436
sample_weight_for_prevalence_correction = np.where(
@@ -416,6 +451,8 @@ def plot_linear_model_parameters(self):
416451

417452
# %% [markdown]
418453
#
454+
# ## Post-hoc prevalence correction for logistic regression by shifting the intercept
455+
#
419456
# From the above results, it seems that the uncorrected linear model and the
420457
# weight-corrected linear models only significantly differ by the value of the
421458
# intercept parameter.
@@ -455,6 +492,8 @@ def plot_linear_model_parameters(self):
455492

456493
# %% [markdown]
457494
#
495+
# ## Generic post-hoc prevalence correction
496+
#
458497
# Let's now consider a more generic post-hoc prevalence correction that does
459498
# not require the base model to be a logistic regression model with an explicit
460499
# `intercept_` parameter.

0 commit comments

Comments
 (0)