LSH-DynED: A Dynamic Ensemble Framework with LSH-Based Undersampling for Evolving Multi-Class Imbalanced Classification

This repository contains the implementation of the LSH-DynED model, a novel, robust, and resilient approach for classifying imbalanced and non-stationary data streams with multiple classes.

Authors:

Soheil Abadifard, Kansas State University (abadifard@k-state.edu)
Fazli Can, Bilkent University (canf@cs.bilkent.edu.tr)

Overview

The classification of imbalanced data streams, where class distributions are unequal and change over time, is a significant challenge in machine learning, especially in multi-class scenarios. LSH-DynED addresses this challenge by integrating Locality Sensitive Hashing with Random Hyperplane Projections (LSH-RHP) into the Dynamic Ensemble Diversification (DynED) framework. This marks the first application of LSH-RHP for undersampling in the context of imbalanced non-stationary data streams.

LSH-DynED undersamples the majority classes using LSH-RHP to create a balanced training set, which in turn improves the prediction accuracy of the ensemble. Our comprehensive experiments on 23 real-world and ten semi-synthetic datasets demonstrate that LSH-DynED outperforms 15 state-of-the-art methods in terms of both Kappa and mG-Mean effectiveness. The model excels in handling large-scale, high-dimensional datasets with significant class imbalances and shows strong adaptation and robustness in real-world scenarios.

Key Features:

Novel Undersampling Technique: First application of Locality Sensitive Hashing with Random Hyperplane Projections (LSH-RHP) for undersampling in multi-class imbalanced non-stationary data streams.
Dynamic Ensemble Framework: Extends and modifies the DynED framework to handle dynamic imbalance ratios in multi-class imbalanced data stream tasks.
State-of-the-Art Performance: Outperforms other methods in both Kappa and mG-Mean effectiveness measures on a wide range of datasets.
Robust and Resilient: Effectively handles concept drift and dynamic changes in class distributions.
Open Source: The implementation is publicly available to encourage further research and improvements.

How it Works

LSH-DynED operates in three main stages:

Prediction and Training: A subset of the ensemble, the "selected components," predicts the label of incoming data instances via majority voting. These components are then trained on the new data instance.
Drift Detection and Adaptation: The ADWIN drift detector monitors the system's performance. If drift is detected, a new component is trained on recent data from a balanced dataset created by our novel undersampling method and added to a pool of "reserved components".
Component Selection: This stage updates the ensemble's components to maintain a balance between diversity and accuracy. Components are selected from the combined pool of "selected" and "reserved" components based on their accuracy and a modified Maximal Marginal Relevance (MMR) algorithm.

Implementation Details

The proposed method is implemented in Python 3.11.7 and utilizes the following libraries:

River 0.21.1
Faiss 1.7.4

The base classifier used is a Hoeffding Tree.

Reproducibility:

For the reproducibility of our results, our implementation is available on GitHub. We have provided all experimental details to make our approach open to new improvements. The baseline methods used for comparison are from the MOA framework, and other implementations are also publicly available.

Baselines

Method	Implementation Link
General-Purpose Methods (GPM)
OzaBagAdwin (OBA)	MOA Framework
Leveraging Bagging (LB)	MOA Framework
ARF	MOA Framework
SRP	MOA Framework
KUE	MOA Framework
BELS	GitHub Repository
DynED	GitHub Repository
Imbalance-Specific Methods (ISM)
HD-VFDT	MOA Framework
GH-VFDT	MOA Framework
MUOB	MOA Framework
MOOB	MOA Framework
ARFR	MOA Framework
CSARF	MOA Framework
CALMID	MOA Framework
ROSE	GitHub Repository
MicFoal	MOA Framework

Usage

To run the LSH-DynED model, follow these steps:

Prepare Your Datasets:
- Create a directory (e.g., datasets/).
- Place all your dataset files (e.g., in .arff format) inside this directory. The script will iterate through and process every file in this folder.

Clone the Repository:

git clone [https://github.com/user/LSH-DynED.git](https://github.com/user/LSH-DynED.git)
cd LSH-DynED

Install Dependencies:
```
pip install -r requirements.txt
```
Configure the Script:
- Open the main.py file.
- Go to the last line of the script:
```
if __name__ == "__main__":
    main('Path to dataset Directory')
```
- Modify the path inside the main() function to point to the directory you created in step 1. For example, if your folder is named datasets, the line should look like this:
```
if __name__ == "__main__":
    main('datasets/')
```
Run the Model:
- Execute the script from your terminal:
```
python main.py
```
- The script will now run the LSH-DynED model on each dataset in the specified folder.

Output

For each dataset processed (e.g., my_data.arff), the script will generate two new CSV files in the same directory:

my_data.arff_mgmean.csv: Contains the G-Mean scores calculated every 500 instances.
my_data.arff_kappa.csv: Contains the prequential Kappa scores.

Hyperparameters:

The default hyperparameter values used in our experiments are detailed in the paper and are set for broad applicability without tuning to any specific dataset. The optimal values we determined are as follows:

Active Components ($S_{slc}$): 10
Training Samples ($n_{train}$): 20
Test Samples ($n_{test}$): 50
Hyperplanes ($n_v$): 5

Experimental Evaluation

We conducted a thorough experimental evaluation on 33 imbalanced datasets, which include 23 real datasets and ten semi-synthetic data streams. The results show that LSH-DynED demonstrates superior performance, especially on datasets with dynamic imbalance ratios.

For a detailed analysis of our results, including performance on specific datasets and comparisons with 15 other methods, please refer to the full paper.

CitationAdd

If you use LSH-DynED in your research, please cite our paper:

@article{Abadifard2025LSHDynED,
 title={LSH-DynED: A Dynamic Ensemble Framework with LSH-Based Undersampling for Evolving Multi-Class Imbalanced Classification}, 
     author={Soheil Abadifard and Fazli Can},
     year={2025},
     eprint={2506.20041},
     archivePrefix={arXiv},
     primaryClass={cs.LG},
     url={https://arxiv.org/abs/2506.20041},
     DOI={10.48550/ARXIV.2506.20041}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Hyperparameter_results		Hyperparameter_results
Model		Model
.deepsource.toml		.deepsource.toml
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LSH-DynED: A Dynamic Ensemble Framework with LSH-Based Undersampling for Evolving Multi-Class Imbalanced Classification

Overview

Key Features:

Reproducibility:

Baselines

Output

Hyperparameters:

CitationAdd

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LSH-DynED: A Dynamic Ensemble Framework with LSH-Based Undersampling for Evolving Multi-Class Imbalanced Classification

Overview

Key Features:

Reproducibility:

Baselines

Output

Hyperparameters:

CitationAdd

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages