Skip to content

Quantile Regression Chapter #859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 46 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
2d348cc
first draft of quantile regression chapter
lona-k Feb 10, 2025
f3e30aa
add chapter
lona-k Feb 10, 2025
a6836f7
fix chapter title
lona-k Feb 10, 2025
51bcca0
fix references
lona-k Feb 10, 2025
f5982fa
change plot view
lona-k Feb 14, 2025
8cbb910
change packages
lona-k Feb 16, 2025
dabeddf
...
be-marc Feb 17, 2025
e81322b
...
be-marc Feb 20, 2025
27b5d43
applied code review changes
lona-k Mar 18, 2025
62ce9be
...
lona-k Mar 18, 2025
7739a51
Update quantile_regression.qmd
berndbischl Apr 10, 2025
642b5ca
apply suggested changes
lona-k May 13, 2025
61c84d7
fix mlr3 loading
lona-k May 13, 2025
e47b108
...
lona-k May 13, 2025
cba1139
add quantile regression survey reference
lona-k May 13, 2025
c406f46
change chapter order
lona-k May 13, 2025
652f437
fix mlr3 installation
lona-k May 13, 2025
f52281f
...
lona-k May 13, 2025
ba78287
Merge branch 'main' into quantile_regression_chapter
lona-k May 13, 2025
4623b49
...
lona-k May 14, 2025
cd98472
...
lona-k May 14, 2025
f457175
...
be-marc May 19, 2025
9195731
add learner weights
be-marc May 21, 2025
d954914
...
be-marc May 21, 2025
971f649
...
be-marc May 21, 2025
108af3a
...
be-marc May 21, 2025
f7984a0
...
be-marc May 21, 2025
7e01d3a
Merge branch 'weights' into quantile_regression_chapter
be-marc May 21, 2025
81b61ca
...
be-marc May 21, 2025
02bbf3b
...
be-marc May 21, 2025
9148578
...
be-marc May 23, 2025
1c28b4a
...
be-marc May 23, 2025
865bab9
...
be-marc May 23, 2025
6000026
Merge branch 'main' into weights
be-marc May 23, 2025
d52e8ed
...
be-marc May 26, 2025
2ed60bc
...
be-marc May 26, 2025
1d1b409
...
be-marc May 26, 2025
b19464a
...
be-marc May 26, 2025
83d4cb1
...
be-marc May 26, 2025
fa66375
...
be-marc May 26, 2025
0dc1d3e
Merge branch 'weights' into quantile_regression_chapter
be-marc May 26, 2025
43ea33d
update solutions
lona-k May 28, 2025
275b5f9
grammar fixes
lona-k May 28, 2025
acff42a
fix plot
lona-k May 28, 2025
c7234f0
change code chunk output options
lona-k Jun 2, 2025
81b9aa0
update solutions
lona-k Jun 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ Imports:
rprojroot,
stringi
Remotes:
mlr-org/mlr3
mlr-org/mlr3extralearners,
mlr-org/mlr3batchmark,
mlr-org/mlr3proba,
Expand Down
5 changes: 3 additions & 2 deletions book/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,9 @@ book:
- chapters/chapter11/large-scale_benchmarking.qmd
- chapters/chapter12/model_interpretation.qmd
- chapters/chapter13/beyond_regression_and_classification.qmd
- chapters/chapter14/algorithmic_fairness.qmd
- chapters/chapter15/predsets_valid_inttune.qmd
- chapters/chapter14/quantile_regression.qmd
- chapters/chapter15/algorithmic_fairness.qmd
- chapters/chapter16/predsets_valid_inttune.qmd
- chapters/references.qmd
appendices:
- chapters/appendices/solutions.qmd # online only
Expand Down
11 changes: 11 additions & 0 deletions book/book.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1436,3 +1436,14 @@ @book{hutter2019automated
publisher = {Springer},
keywords = {}
}
@article{yu_quantile_2003,
author = {Yu, Keming and Lu, Zudi and Stander, Julian},
doi = {10.1111/1467-9884.00363},
journal = {Journal of the Royal Statistical Society: Series D (The Statistician)},
number = {3},
pages = {331--350},
title = {Quantile regression: applications and current research areas},
doi = {10.1111/1467-9884.00363},
volume = {52},
year = {2003},
}
Binary file added book/chapters/appendices/Rplots.pdf
Binary file not shown.
88 changes: 61 additions & 27 deletions book/chapters/appendices/solutions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2018,13 +2018,45 @@ benchmark(design)$aggregate(meas)[, .(learner_id, clust.silhouette)]

We can see that we get the silhouette closest to `1` with `K=2` so we might use this value for future experiments.

## Solutions to @sec-quantile-regression

1. Manually `$train()` a GBM regression model from `r ref_pkg("mlr3extralearners")` on the mtcars task to predict the 95th percentile of the target variable. Make sure that you split the data and only use the test data for fitting the earner.

We start by loading the packages and creating the tasks and the test train split.

```{r solutions-108}
library(mlr3)
library(mlr3extralearners)

task = tsk("california_housing")
splits = partition(task)
```

In the next step, we initialize the learner as `"regr.gbm"` and explicitly set the quantiles parameter to 0.95. In order for the learner to be able to predict this quantile, we need to specify the `predict_type`. Lastly, we train the learner using only the test data.

```{r solutions-109}
lrn_gbm = lrn("regr.gbm", predict_type = "quantiles", quantiles = 0.95)
lrn_gbm$train(task, row_ids = splits$train)
```

2. Use the test data to evaluate your learner with the pinball loss.

Firstly, we use the learner from the last exercise to predict on the test data. Then we calculate the pinball loss using the predictions.

```{r solutions-110}
prds_gbm = lrn_gbm$predict(task, row_ids = splits$test)
score_gbm = prds_gbm$score(msr("regr.pinball", alpha = 0.95, id = "q0.95"))
score_gbm
```


## Solutions to @sec-fairness

1. Train a model of your choice on `tsk("adult_train")` and test it on `tsk("adult_test")`, use any measure of your choice to evaluate your predictions. Assume our goal is to achieve parity in false omission rates across the protected 'sex' attribute. Construct a fairness metric that encodes this and evaluate your model. To get a deeper understanding, look at the `r ref("groupwise_metrics")` function to obtain performance in each group.

For now we simply load the data and look at the data.

```{r solutions-108}
```{r solutions-111}
library(mlr3)
library(mlr3fairness)
set.seed(8)
Expand All @@ -2036,7 +2068,7 @@ tsk_adult_train

We can now train a simple model, e.g., a decision tree and evaluate for accuracy.

```{r solutions-109}
```{r solutions-112}
learner = lrn("classif.rpart")
learner$train(tsk_adult_train)
prediction = learner$predict(tsk_adult_test)
Expand All @@ -2046,30 +2078,30 @@ prediction$score()
The *false omission rate parity* metric is available via the key `"fairness.fomr"`.
Note, that evaluating our prediction now requires that we also provide the task.

```{r solutions-110}
```{r solutions-113}
msr_1 = msr("fairness.fomr")
prediction$score(msr_1, tsk_adult_test)
```

In addition, we can look at false omission rates in each group.
The `groupwise_metrics` function creates a metric for each group specified in the `pta` column role:

```{r solutions-111}
```{r solutions-114}
tsk_adult_test$col_roles$pta
```

We can then use this metric to evaluate our model again.
This gives us the false omission rates for male and female individuals separately.

```{r solutions-112}
```{r solutions-115}
msr_2 = groupwise_metrics(base_measure = msr("classif.fomr"), task = tsk_adult_test)
prediction$score(msr_2, tsk_adult_test)
```

2. Improve your model by employing pipelines that use pre- or post-processing methods for fairness. Evaluate your model along the two metrics and visualize the resulting metrics. Compare the different models using an appropriate visualization.

First we can again construct the learners above.
```{r solutions-113}
```{r solutions-116}
library(mlr3pipelines)
lrn_1 = po("reweighing_wts") %>>% lrn("classif.rpart")
lrn_2 = po("learner_cv", lrn("classif.rpart")) %>>%
Expand All @@ -2078,7 +2110,7 @@ lrn_2 = po("learner_cv", lrn("classif.rpart")) %>>%

And run the benchmark again. Note, that we use three-fold CV this time for comparison.

```{r solutions-114}
```{r solutions-117}
learners = list(learner, lrn_1, lrn_2)
design = benchmark_grid(tsk_adult_train, learners, rsmp("cv", folds = 3L))
bmr = benchmark(design)
Expand All @@ -2087,7 +2119,7 @@ bmr$aggregate(msrs(c("classif.acc", "fairness.fomr")))

We can now again visualize the result.

```{r solutions-115}
```{r solutions-118}
library(ggplot2)
fairness_accuracy_tradeoff(bmr, msr("fairness.fomr")) +
scale_color_viridis_d("Learner") +
Expand All @@ -2105,12 +2137,12 @@ We can notice two main results:

This can be achieved by adding "race" to the `"pta"` col_role.

```{r solutions-116}
```{r solutions-119}
tsk_adult_train$set_col_roles("race", add_to = "pta")
tsk_adult_train
```

```{r solutions-117}
```{r solutions-120}
tsk_adult_test$set_col_roles("race", add_to = "pta")
prediction$score(msr_1, tsk_adult_test)
```
Expand All @@ -2120,12 +2152,12 @@ Note, that the metric by default computes the maximum discrepancy between all me

If we now compute the `groupwise_metrics`, we will get a metric for the intersection of each group.

```{r solutions-118}
```{r solutions-121}
msr_3 = groupwise_metrics(msr("classif.fomr"), tsk_adult_train)
unname(sapply(msr_3, function(x) x$id))
```

```{r solutions-119}
```{r solutions-122}
prediction$score(msr_3, tsk_adult_test)
```

Expand All @@ -2143,7 +2175,7 @@ We'll go through them one by one to deepen our understanding:

We can investigate this further by looking at actual counts:

```{r solutions-120}
```{r solutions-123}
table(tsk_adult_test$data(cols = c("race", "sex", "target")))
```

Expand All @@ -2155,17 +2187,19 @@ We'll go through them one by one to deepen our understanding:

First, we create a subset of only `sex`: `Female` and `race`: `"Black", "White"`.

```{r solutions-121}
```{r solutions-124}
adult_subset = tsk_adult_test$clone()
df = adult_subset$data()
rows = seq_len(nrow(df))[df$race %in% c("Black", "White") & df$sex %in% c("Female")]
adult_subset$filter(rows)
adult_subset$set_col_roles("race", add_to = "pta")
```

And evaluate our measure again:

```{r solutions-122}
prediction$score(msr_3, adult_subset)
```{r solutions-125}
#| eval: false
prediction$score(msr_3, task = adult_subset)
```

We can see, that between women there is an even bigger discrepancy compared to men.
Expand All @@ -2181,7 +2215,7 @@ We can see, that between women there is an even bigger discrepancy compared to m

We start by loading the packages and creating the task.

```{r solutions-123}
```{r solutions-126}
library(mlr3)
library(mlr3extralearners)
library(mlr3pipelines)
Expand All @@ -2192,14 +2226,14 @@ tsk_pima

Below, we see that the task has five features with missing values.

```{r solutions-124}
```{r solutions-127}
tsk_pima$missings()
```

Next, we create the LightGBM classifier, but don't specify the validation data yet.
We handle the missing values using a simple median imputation.

```{r solutions-125}
```{r solutions-128}
lrn_lgbm = lrn("classif.lightgbm",
num_iterations = 1000,
early_stopping_rounds = 10,
Expand All @@ -2216,15 +2250,15 @@ The call below sets the `$validate` field of the LightGBM pipeop to `"predefined
Recall that only the graphlearner itself can specify *how* the validation data is generated.
The individual pipeops can either use it (`"predefined"`) or not (`NULL`).

```{r solutions-126}
```{r solutions-129}
set_validate(glrn, validate = 0.3, ids = "classif.lightgbm")
glrn$validate
glrn$graph$pipeops$classif.lightgbm$validate
```

Finally, we train the learner and inspect the validation scores and internally tuned parameters.

```{r solutions-127}
```{r solutions-130}
glrn$train(tsk_pima)

glrn$internal_tuned_values
Expand All @@ -2240,7 +2274,7 @@ glrn$internal_valid_scores
We start by setting the number of boosting iterations to an internal tune token where the maximum number of boosting iterations is 1000 and the aggregation function the maximum.
Note that the input to the aggregation function is a list of integer values (the early stopped values for the different resampling iterations), so we need to `unlist()` it first before taking the maximum.

```{r solutions-128}
```{r solutions-131}
library(mlr3tuning)

glrn$param_set$set_values(
Expand All @@ -2252,14 +2286,14 @@ glrn$param_set$set_values(

Now, we change the validation data from `0.3` to `"test"`, where we can omit the `ids` specification as LightGBM is the base learner.

```{r solutions-129}
```{r solutions-132}
set_validate(glrn, validate = "test")
```

Next, we create the autotuner using the configuration given in the instructions.
As the internal validation measures are calculated by `lightgbm` and not `mlr3`, we need to specify whether the metric should be minimized.

```{r solutions-130}
```{r solutions-133}
at_lgbm = auto_tuner(
learner = glrn,
tuner = tnr("internal"),
Expand All @@ -2272,7 +2306,7 @@ at_lgbm$id = "at_lgbm"

Finally, we set up the benchmark design, run it, and evaluate the learners in terms of their classification accuracy.

```{r solutions-131}
```{r solutions-134}
design = benchmark_grid(
task = tsk_pima,
learners = list(at_lgbm, lrn("classif.rpart")),
Expand All @@ -2286,7 +2320,7 @@ bmr$aggregate(msr("classif.acc"))

3. Consider the code below:

```{r solutions-132}
```{r solutions-135}
branch_lrn = as_learner(
ppl("branch", list(
lrn("classif.ranger"),
Expand Down Expand Up @@ -2349,7 +2383,7 @@ Note that we would normally recommend setting the validation data to `"test"` wh

4. Look at the (failing) code below:

```{r solutions-133, error = TRUE}
```{r solutions-136, error = TRUE}
tsk_sonar = tsk("sonar")
glrn = as_learner(
po("pca") %>>% lrn("classif.xgboost", validate = 0.3)
Expand Down
14 changes: 14 additions & 0 deletions book/chapters/chapter1/introduction_and_overview.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,20 @@ aliases:
- "/introduction_and_overview.html"
---


```{r}
# extra packages that must be installed in the docker image
remotes::install_github("mlr-org/mlr3")
remotes::install_github("mlr-org/mlr3pipelines")
remotes::install_github("mlr-org/mlr3fairness@weights")
remotes::install_github("mlr-org/mlr3learners")
remotes::install_github("mlr-org/mlr3extralearners")
remotes::install_cran("qgam")
remotes::install_github("mlr-org/mlr3batchmark")
remotes::install_cran("iml")
remotes::install_github("mlr-org/mlr3spatiotempcv@task_row_hash")
```

# Introduction and Overview {#sec-introduction}

{{< include ../../common/_setup.qmd >}}
Expand Down
Binary file added book/chapters/chapter10/Rplots.pdf
Binary file not shown.
Loading