`xplainfi`

The goal of xplainfi is to collect common feature importance methods under a unified and extensible interface.

It is built around mlr3 as available abstractions for learners, tasks, measures, etc. greatly simplify the implementation of importance measures.

Installation

You can install the development version of xplainfi like so:

# install.packages(pak)
pak::pak("jemus42/xplainfi")

Example: PFI

Here is a basic example on how to calculate PFI for a given learner and task, using repeated cross-validation as resampling strategy and computing PFI within each resampling 5 times on the friedman1 task (see ?mlbench::mlbench.friedman1).

The friedman1 task has the following structure:

$$y = 10 \sin(\pi x_1 x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + \varepsilon$$

Where $x_{1,2,3,4,5}$ are named important1 through important5 in the Task, with additional numbered unimportant features without effect on $y$.

library(xplainfi)
library(mlr3learners)
#> Loading required package: mlr3

task = tgen("friedman1")$generate(1000)
learner = lrn("regr.ranger", num.trees = 100)
measure = msr("regr.mse")

pfi = PFI$new(
    task = task,
    learner = learner,
    measure = measure,
    resampling = rsmp("cv", folds = 3),
    n_repeats = 5
)

Compute and print PFI scores:

pfi$compute()
pfi$importance()
#> Key: <feature>
#>          feature   importance
#>           <char>        <num>
#>  1:   important1  7.647695981
#>  2:   important2  8.221675536
#>  3:   important3  1.864822092
#>  4:   important4 13.471577093
#>  5:   important5  2.368559017
#>  6: unimportant1 -0.015980444
#>  7: unimportant2  0.007477719
#>  8: unimportant3  0.004806562
#>  9: unimportant4  0.019497114
#> 10: unimportant5  0.001911799

If it aides interpretation, importances can also be calculates as the ratio rather then the difference between the baseline and post-permutation losses:

pfi$importance(relation = "ratio")
#> Key: <feature>
#>          feature importance
#>           <char>      <num>
#>  1:   important1  2.5799596
#>  2:   important2  2.7087461
#>  3:   important3  1.3877158
#>  4:   important4  3.7800743
#>  5:   important5  1.4901341
#>  6: unimportant1  0.9963189
#>  7: unimportant2  1.0013810
#>  8: unimportant3  1.0014155
#>  9: unimportant4  1.0037671
#> 10: unimportant5  0.9999717

When PFI is computed based on resampling with multiple iterations, and / or multiple permutation iterations, the individual scores can be retrieved as a data.table:

str(pfi$scores())
#> Classes 'data.table' and 'data.frame':   150 obs. of  6 variables:
#>  $ feature          : chr  "important1" "important1" "important1" "important1" ...
#>  $ iter_rsmp        : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ iter_repeat      : int  1 2 3 4 5 1 2 3 4 5 ...
#>  $ regr.mse_baseline: num  5.3 5.3 5.3 5.3 5.3 ...
#>  $ regr.mse_post    : num  13.1 12.7 13.1 14.3 12.8 ...
#>  $ importance       : num  7.79 7.44 7.85 9.02 7.52 ...
#>  - attr(*, ".internal.selfref")=<externalptr>

Where iter_rsmp corresponds to the resampling iteration, i.e., 3 for 3-fold cross-validation, and iter_repeat corresponds to the permutation iteration within each resampling iteration, 5 in this case. While pfi$importance() contains the means across all iterations, pfi$scores() allows you to manually visualize or aggregate them in any way you see fit.

For example:

library(ggplot2)

ggplot(
    pfi$scores(),
    aes(x = importance, y = reorder(feature, importance))
) +
    geom_boxplot(color = "#f44560", fill = alpha("#f44560", 0.4)) +
    labs(
        title = "Permutation Feature Importance on Friedman1",
        subtitle = "Computed over 3-fold CV with 5 permutations per iteration using Random Forest",
        x = "Importance",
        y = "Feature"
    ) +
    theme_minimal(base_size = 16) +
    theme(
        plot.title.position = "plot",
        panel.grid.major.y = element_blank()
    )

If the measure in question needs to be maximized rather than minimized (like $R^2$), the internal importance calculation takes that into account via the $minimize property of the measure and calculates importances such that the intuition “performance improvement” -> “higher importance score” still holds:

pfi = PFI$new(
    task = task,
    learner = learner,
    measure = msr("regr.rsq")
)
#> ℹ No <Resampling> provided
#> Using `resampling = rsmp("holdout")` with default `ratio = 0.67`.

pfi$compute()
pfi$importance()
#> Key: <feature>
#>          feature    importance
#>           <char>         <num>
#>  1:   important1  0.2705871350
#>  2:   important2  0.3035832691
#>  3:   important3  0.0640029372
#>  4:   important4  0.4860720486
#>  5:   important5  0.0720620500
#>  6: unimportant1 -0.0022247186
#>  7: unimportant2  0.0022554009
#>  8: unimportant3 -0.0028430988
#>  9: unimportant4 -0.0008701439
#> 10: unimportant5 -0.0015809452

See vignette("xplainfi") for more examples.

Name		Name	Last commit message	Last commit date
Latest commit History 266 Commits
.github		.github
.vscode		.vscode
R		R
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.lintr		.lintr
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
Makefile		Makefile
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
air.toml		air.toml
codecov.yml		codecov.yml
xplainfi.Rproj		xplainfi.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Licenses found

Uh oh!

Repository files navigation

`xplainfi`

Installation

Example: PFI

About

Licenses found

Uh oh!

Releases

Sponsor this project

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

Uh oh!

License

Licenses found

mlr-org/xplainfi

Folders and files

Latest commit

History

Repository files navigation

xplainfi

Installation

Example: PFI

About

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

`xplainfi`