@@ -375,6 +375,30 @@ def plot_linear_model_parameters(self):
375375population_comparator .plot_linear_model_parameters ()
376376population_comparator .score_table ()
377377
378+ # %% [markdown]
379+ #
380+ # The above results show that naively fitting a logistic regression model on
381+ # the observed data yields a model with very good ranking power (as measured by
382+ # ROC-AUC on the target population) and the feature coefficients seem to be
383+ # well aligned with the parameters of the data generating process.
384+ #
385+ # However, it does not fully recover the true data generating process. In
386+ # particular, the intercept value is not correctly estimated. As a result, the
387+ # value of the log-loss is very poor because the probabilistic predictions are
388+ # not well calibrated.
389+ #
390+ # In the following, we will explore several ways to correct this problem.
391+
392+ # %% [markdown]
393+ #
394+ # ## Weight-based prevalence correction for Logistic Regression
395+ #
396+ # The first approach we will explore is to use class weights to correct for the
397+ # prevalence shift in the training data.
398+ #
399+ # In scikit-learn, class weights are passed as constructor parameter to the
400+ # estimator (one weight per possible class). Here we pass the ratios of each
401+ # class prevalence in the target population and the training set:
378402
379403# %%
380404class_weight_for_prevalence_correction = {
@@ -392,10 +416,21 @@ def plot_linear_model_parameters(self):
392416population_comparator .plot_linear_model_parameters ()
393417population_comparator .score_table ()
394418
419+ # %% [markdown]
420+ #
421+ # We can see that passing class-weights to the estimator has the desired effect
422+ # of correcting the prevalence shift in the training data: the ROC-AUC values
423+ # stays roughly the same while the log-loss is significantly reduced and nearly
424+ # matches the expected log-loss of the data generating process when evaluated
425+ # on the target population.
426+
395427# %% [markdown]
396428#
397429# Let's check that we can get exactly the same results using `sample_weight` in
398- # `fit` instead of `class_weight` in the constructor.
430+ # `fit` instead of `class_weight` in the constructor. We just repeat the
431+ # class-weight for each data point in the training set based on their class
432+ # label. If all goes well, this should be strictly equivalent hence should
433+ # converge exactly to the same model.
399434
400435# %%
401436sample_weight_for_prevalence_correction = np .where (
@@ -416,6 +451,8 @@ def plot_linear_model_parameters(self):
416451
417452# %% [markdown]
418453#
454+ # ## Post-hoc prevalence correction for logistic regression by shifting the intercept
455+ #
419456# From the above results, it seems that the uncorrected linear model and the
420457# weight-corrected linear models only significantly differ by the value of the
421458# intercept parameter.
@@ -455,6 +492,8 @@ def plot_linear_model_parameters(self):
455492
456493# %% [markdown]
457494#
495+ # ## Generic post-hoc prevalence correction
496+ #
458497# Let's now consider a more generic post-hoc prevalence correction that does
459498# not require the base model to be a logistic regression model with an explicit
460499# `intercept_` parameter.
0 commit comments