Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
1747641
fix
ReinierKoops Feb 9, 2025
08c6e6f
improve plots (displays discrete stats instead of non-existent contin…
ReinierKoops Mar 13, 2025
732c88c
add more comments, cleanup some code, add type checking
ReinierKoops Mar 14, 2025
98c809d
fix utils
ReinierKoops Mar 14, 2025
72d5133
fix arrayfuncs
ReinierKoops Mar 14, 2025
87637b4
fix exceptions
ReinierKoops Mar 14, 2025
49b3eb2
fix model_interpret
ReinierKoops Mar 14, 2025
ab4c1f6
fix test shap dependence
ReinierKoops Mar 14, 2025
167ed29
fix test model interpret
ReinierKoops Mar 14, 2025
40d97ea
fix arrayfuncs
ReinierKoops Mar 14, 2025
a8b3a53
fix model_interpret
ReinierKoops Mar 14, 2025
8b6a6cb
fix model_interpret
ReinierKoops Mar 14, 2025
0c8d064
fix some more files
ReinierKoops Mar 15, 2025
7864142
add more comments, cleanup some code, add type checking
ReinierKoops Mar 15, 2025
98eaf5c
add more comments, cleanup some code, add type checking
ReinierKoops Mar 15, 2025
ebb7ad0
add more comments, cleanup some code, add type checking
ReinierKoops Mar 15, 2025
ef0c8d4
add more comments, cleanup some code, add type checking
ReinierKoops Mar 15, 2025
faa2bd2
add more comments, cleanup some code, add type checking
ReinierKoops Mar 15, 2025
bb80871
add more comments, cleanup some code, add type checking
ReinierKoops Mar 15, 2025
9bec0ea
add more comments, cleanup some code, add type checking
ReinierKoops Mar 15, 2025
974f226
add more comments, cleanup some code, add type checking
ReinierKoops Mar 15, 2025
8398ab4
added progress bar
ReinierKoops Mar 15, 2025
d875a51
fix some more files
ReinierKoops Mar 15, 2025
261be74
remove enter
ReinierKoops Mar 15, 2025
232361b
merge main
ReinierKoops Mar 15, 2025
008da55
change package versions
ReinierKoops Mar 15, 2025
d416f1a
change package versions
ReinierKoops Mar 15, 2025
ca8396d
change package versions
ReinierKoops Mar 15, 2025
3480545
change package versions
ReinierKoops Mar 15, 2025
f4a1d8a
small cleanup
ReinierKoops Mar 15, 2025
b23b515
small cleanup
ReinierKoops Mar 15, 2025
6dff53e
start adding more extensive testing of plotting & standardizing it
ReinierKoops Mar 16, 2025
24177ef
restructure test suite -- will remove a lot of 'unit'-test content
ReinierKoops Mar 16, 2025
fbe3e1b
changes have become major. Therefore version is major. Things are not…
ReinierKoops Mar 16, 2025
aba197b
relax dependencies to earliest that support numpy v2
ReinierKoops Mar 17, 2025
c4badda
restructure probatus & add tests accordingly
ReinierKoops Mar 17, 2025
8dd16fe
remove permutation plot since integrated in integration test
ReinierKoops Mar 17, 2025
c8fffad
save plots for PRs'
ReinierKoops Mar 17, 2025
058b95d
add backwards compatibility matplotlib
ReinierKoops Mar 17, 2025
c64d525
update API docs
ReinierKoops Mar 17, 2025
066f0fa
update docs
ReinierKoops Mar 17, 2025
df6ee1e
adding multi-class & using newer way of SHAP
ReinierKoops Mar 18, 2025
972fd2d
progressing through adding more multi-class options
ReinierKoops Mar 19, 2025
3180422
progressing through adding more multi-class options
ReinierKoops Mar 19, 2025
83503bc
remove mocking
ReinierKoops Mar 20, 2025
bd4dfcc
multi-class is now working
ReinierKoops Mar 20, 2025
045dd7f
split up Shap recursive feature elimination into separate helper classes
ReinierKoops Mar 21, 2025
db0794c
support for pipelines, updated shap rfe plot, onto the next plots
ReinierKoops Mar 21, 2025
9a023d0
updated SHAP related plots & changed to newer versions of the same pl…
ReinierKoops Mar 21, 2025
59d9b86
plots of interpreter should no longer display twice
ReinierKoops Mar 23, 2025
3492428
fix Multi-class SHAP RFECV and provide two execution modes optimized …
ReinierKoops Mar 26, 2025
b93a681
fix nbs
ReinierKoops Mar 26, 2025
6a3191b
work on improving test (& fixing small bugs)
ReinierKoops Mar 29, 2025
a25f4c3
refactoring
ReinierKoops Mar 30, 2025
8c18470
refactoring
ReinierKoops Apr 1, 2025
e3bcdba
refactoring -- added estimator, data and shap manager classes
ReinierKoops Apr 4, 2025
204f068
minimal shap wrapper created
ReinierKoops Apr 5, 2025
9ebeac6
halfway integrating wrapper classes
ReinierKoops Apr 6, 2025
d8cfd7d
started creating easy classes in order to simplify existing classes …
ReinierKoops Apr 8, 2025
56393cf
-- in the middle of refactoring
ReinierKoops Apr 10, 2025
7bf99ea
more plotting
Apr 10, 2025
ea2ecd0
-- in the middle of refactoring
ReinierKoops Apr 15, 2025
2734999
-- in the middle of refactoring
ReinierKoops Apr 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 36 additions & 36 deletions .github/workflows/unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,46 +22,46 @@ jobs:
os: windows-latest
python-version: [3.9, "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@master
- uses: actions/checkout@master

- name: Get latest CMake and Ninja
uses: lukka/get-cmake@latest
with:
cmakeVersion: latest
ninjaVersion: latest
- name: Get latest CMake and Ninja
uses: lukka/get-cmake@latest
with:
cmakeVersion: latest
ninjaVersion: latest

- name: Install LIBOMP on Macos runners
if: runner.os == 'macOS'
run: |
brew install libomp
- name: Install LIBOMP on Macos runners
if: runner.os == 'macOS'
run: |
brew install libomp

- name: Setup Python
uses: actions/setup-python@master
with:
python-version: ${{ matrix.python-version }}
- name: Setup Python
uses: actions/setup-python@master
with:
python-version: ${{ matrix.python-version }}

- name: Install Python dependencies
run: |
pip3 install --upgrade setuptools pip
pip3 install ".[all]"
- name: Install Python dependencies
run: |
pip3 install --upgrade setuptools pip
pip3 install ".[all]"

- name: Run linters
run: |
pre-commit install
pre-commit run --all-files
- name: Run linters
run: |
pre-commit install
pre-commit run --all-files

- name: Run (unit) tests
env:
TEST_NOTEBOOKS: 0
run: |
pytest --cov=probatus/binning --cov=probatus/metric_volatility --cov=probatus/missing_values --cov=probatus/sample_similarity --cov=probatus/stat_tests --cov=probatus/utils --cov=probatus/interpret/ --ignore==tests/interpret/test_inspector.py --cov-report=xml
pyflakes probatus
- name: Run (unit) tests
env:
TEST_NOTEBOOKS: 0
run: |
pytest --save-plots true
pyflakes probatus

- name: Upload coverage to Codecov
if: github.ref == 'refs/heads/main'
uses: codecov/codecov-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
file: ./coverage.xml
flags: unittests
fail_ci_if_error: false
- name: Upload coverage to Codecov
if: github.ref == 'refs/heads/main'
uses: codecov/codecov-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
file: ./coverage.xml
flags: unittests
fail_ci_if_error: false
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ notebooks/explore
.idea/
docs/source/.ipynb_checkpoints/
notebooks/.ipynb_checkpoints/

tests/**/*.png

# Created by https://www.gitignore.io/api/macos,python,pycharm,jupyternotebooks,visualstudiocode
# Edit at https://www.gitignore.io/?templates=macos,python,pycharm,jupyternotebooks,visualstudiocode
Expand Down
File renamed without changes.
53 changes: 53 additions & 0 deletions docs/api/core.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Core Components

The core module provides fundamental building blocks and interfaces used throughout the probatus package.

## Overview

The core module establishes the architectural foundation for probatus components through abstract base classes and custom exceptions. This module defines consistent interfaces that ensure all probatus components follow similar patterns for fitting, computing results, and visualization.

### Key Components

#### Abstract Base Classes

##### BaseFitComputeClass
- Defines the standard interface for components that need to be fitted to data and compute results
- Implements a consistent pattern with three essential methods:
- `fit()`: Trains the component on input data
- `compute()`: Calculates results based on fitted state
- `fit_compute()`: Combines fitting and computation in a single step
- Provides error checking to ensure methods are called in the correct sequence

##### BaseFitComputePlotClass
- Extends BaseFitComputeClass with visualization capabilities
- Adds the required `plot()` method for result visualization
- Used as the base class for components that need to generate visual insights
- Ensures consistent interface for all components with plotting functionality

#### Exceptions

##### NotFittedError
- Custom exception raised when a method requiring a fitted state is called prematurely
- Provides clear error messages indicating which operation failed
- Helps prevent incorrect usage of probatus components

### Core Design Principles

The core module implements several key design principles that are reflected throughout the probatus package:

1. **Consistent Interfaces**: All analytical components share a common interface pattern, making the library easier to learn and use.

2. **Method Chaining**: Base classes support method chaining (returning `self` from fit methods) for more concise code.

3. **Explicit State Validation**: Components explicitly check their fitted state before performing operations that require prior fitting.

4. **Separation of Concerns**: Clear separation between data fitting, result computation, and visualization.

5. **Extensibility**: Abstract base classes make it straightforward to add new components while maintaining consistent behavior.

These core components serve as the foundation for more specialized classes throughout probatus, ensuring that all parts of the library work together seamlessly.

## Implementation

::: probatus.core.base
::: probatus.core.exceptions
91 changes: 91 additions & 0 deletions docs/api/dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Dataset

The goal of dataset module is understanding how different two samples are from a multivariate perspective.

One of the ways to indicate this is Resemblance Model. Having two datasets - say X1 and X2 - one can analyse how easy it is to recognize which dataset a randomly selected row comes from. The Resemblance model assigns label 0 to the dataset X1, and label 1 to X2 and trains a binary classification model to predict which sample a given row comes from.
By looking at the test AUC, one can conclude that the samples have a different distribution if the AUC is significantly higher than 0.5. Furthermore, by analysing feature importance one can understand which of the features have predictive power.

<img src="../img/resemblance_model_schema.png"/>

## Overview

The sample similarity module provides sophisticated tools for analyzing and comparing multivariate datasets using machine learning approaches. This module is particularly valuable for:

- Detecting distribution shifts between datasets
- Identifying features that contribute to dataset differences
- Understanding data drift in production environments
- Validating data quality and consistency
- Supporting data monitoring and maintenance

### Key Components

#### BaseResemblanceModel
- Abstract base class that provides the foundation for all resemblance models
- Handles data preparation, model training, and evaluation
- Supports both standalone estimators and sklearn Pipelines
- Features:
- Flexible scoring metrics
- Train/test split configuration
- Comprehensive performance reporting
- Parallel processing support

#### SHAPImportanceResemblance (Recommended)
- Leverages SHAP (SHapley Additive exPlanations) for feature importance analysis
- Key features:
- Tree-based model interpretation
- Detailed feature importance analysis
- Support for complex feature interactions
- Multiple visualization options (bar, dot, violin plots)
- Advanced SHAP configuration options

#### PermutationImportanceResemblance
- Uses permutation feature importance for analysis
- Key features:
- Direct measurement of feature impact on model performance
- Configurable number of permutation iterations
- Simple and interpretable results
- Robust to feature interactions
- Efficient computation with parallel processing

### Features
- Multiple analysis approaches:
- SHAP-based importance (recommended for detailed analysis)
- Permutation-based importance (faster for initial screening)
- Comprehensive model support:
- Binary classification models
- Tree-based algorithms
- Custom model integration
- Pipeline compatibility
- Advanced analysis capabilities:
- Feature importance ranking
- Distribution shift detection
- Feature interaction analysis
- Detailed reporting and visualization:
- Feature importance plots
- Performance metrics
- Distribution comparisons

### Use Cases
- Data drift detection
- Feature importance analysis
- Dataset comparison
- Quality assurance
- Production monitoring
- Data validation

The module is designed to provide robust tools for understanding and comparing multivariate datasets, with a focus on interpretability and practical application.

The following features are implemented:

- [BaseResemblanceModel][probatus.dataset.resemblance_modeler.BaseResemblanceModel]:
Abstract base class that provides core functionality for all resemblance models. Handles data preparation, model training, and performance evaluation.

- [SHAPImportanceResemblance (Recommended)][probatus.dataset.resemblance_modeler.SHAPImportanceResemblance]:
The class applies SHAP library to interpret tree-based resemblance models. Features multiple visualization options and detailed feature importance analysis.

- [PermutationImportanceResemblance][probatus.dataset.resemblance_modeler.PermutationImportanceResemblance]:
The class applies permutation feature importance to understand which features the model relies on most. The higher the importance of the feature, the more a given feature possibly differs in X2 compared to X1. The importance indicates how much the test performance drops if a given feature is permuted.

## Implementation

::: probatus.dataset.resemblance_modeler
7 changes: 0 additions & 7 deletions docs/api/feature_elimination.md

This file was deleted.

80 changes: 80 additions & 0 deletions docs/api/features.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Features Elimination

This module focuses on feature elimination and it contains the following:

- [ShapRFECV][probatus.feature_elimination.feature_elimination.ShapRFECV]: Perform Backwards Recursive Feature Elimination, using SHAP feature importance. It supports binary classification, multi-class classification, regression models and hyperparameter optimization at every feature elimination step. For XGBoost, LightGBM and CatBoost it supports early stopping of the model fitting process. It can be an alternative regularization technique to hyperparameter optimization of the number of base trees in gradient boosted tree models. Particularly useful when dealing with large datasets.

## Overview

The feature elimination module provides sophisticated tools for selecting the most important features in machine learning models using SHAP-based importance metrics. This module is particularly valuable for:

- Reducing model complexity while maintaining performance
- Identifying the most impactful features
- Optimizing model training time
- Improving model interpretability

### Key Components

#### ShapRFECV (Recursive Feature Elimination with Cross-Validation)
- Implements backward feature elimination using SHAP importance
- Key features:
- Supports both classification and regression tasks
- Integrates with hyperparameter optimization via SearchCV, RandomizedSearchCV, or BayesSearchCV, etc...
- Provides cross-validation at each elimination step
- Generates detailed feature selection reports
- Supports custom feature selection strategies
- Includes advanced methods to select optimal feature subsets:
- Best performance-based selection
- Coherent performance with minimal variance
- Parsimonious selection (fewest features within a performance threshold)
- Supports sklearn Pipeline objects for seamless preprocessing integration

#### Early Stopping Support
- Directly integrated into ShapRFECV (no separate class needed)
- Optimized for popular gradient boosting frameworks:
- LightGBM
- XGBoost
- CatBoost
- Benefits:
- Faster training process
- Prevents overfitting
- Reduces computational resources
- Maintains model performance

### Features
- Flexible feature elimination strategies:
- Fixed number of features per step
- Percentage-based elimination (adaptive step sizes)
- Custom elimination criteria
- Comprehensive model support:
- Binary classification
- Multi-class classification
- Regression models
- Advanced optimization capabilities:
- Hyperparameter tuning integration
- Cross-validation at each step
- Early stopping for gradient boosted models
- Support for sample weighting
- Optional SHAP variance penalization for more stable feature selection
- Detailed reporting and visualization:
- Feature importance tracking
- Performance metrics across steps
- Selection process visualization
- Error bars showing model stability

### Use Cases
- Large-scale feature selection
- Model optimization
- Feature importance analysis
- Computational resource optimization
- Model interpretability enhancement
- Production model preparation

The module is designed to provide a robust and efficient approach to feature selection using RFECV in combination with SHAP, with a focus on practical application in real-world scenarios.

## Implementation

::: probatus.features.shap_recursive_feature_elimination
::: probatus.features.shap_recursive_feature_elimination_helper
::: probatus.features.shap_early_stopping_recursive_feature_elimination
::: probatus.features.shap_early_stopping_recursive_feature_elimination_helper
Loading
Loading