Benchmarking Intel® Extension for Scikit-learn (scikit-learn-intelex) vs. CuPy-accelerated methods for machine learning workflows.
This project benchmarks and compares the performance of various machine learning algorithms using Intel's scikit-learn-intelex and NVIDIA's CuPy acceleration across multiple datasets. The goal is to determine which optimization—CPU-based (Intelex) or GPU-based (CuPy)—performs better for specific algorithms and dataset types, as part of a college research project.
- scikit-learn-intelex: Optimizes select scikit-learn estimators using low-level CPU instructions and multi-threading for faster performance.
- CuPy: GPU array library with a NumPy-compatible API; enables GPU acceleration for array computations.
Install the required dependencies:
pip install scikit-learn scikit-learn-intelex cupy-cuda xgboost
If you encounter missing dependencies, also install:
pip install pandas matplotlib numpy
Note: Some datasets or algorithms may require additional libraries (e.g., warnings for suppression).
- Preprocessing: Each dataset was cleaned and preprocessed to fit the requirements of the algorithms and libraries.
- Model Selection: The following ML models were tested:
- LightGBM
- XGBoost
- Logistic Regression
- Random Forest
- AdaBoost
- Multi-Layer Perceptron (MLP)
- Implementation Variants:
- BareBones: Standard scikit-learn implementation.
- Intelex: scikit-learn accelerated with scikit-learn-intelex.
- CuPy: Data arrays and supported models accelerated with CuPy (where possible).
- Benchmarking: Each model was run on all datasets, measuring wall-clock training time.
- CPU: Intel Core i5 (12th Gen)
- GPU: NVIDIA GeForce GTX 1660 Super
- Storage: SSD
- RAM: 32 GB
Name | Size | Features | Notes |
---|---|---|---|
Wiretap | 7 GB | 115 numeric | Largest |
Loan Prediction | 20 MB | 12 categorical | Small |
Student Performance | 41 KB | 10 numeric | Smallest |
Diabetes | 636 KB | 10 numeric | Small |
Cyberbullying Detection | 3.67 MB | 7 text/cat. | Medium, text-heavy |
- Small datasets (Student Performance, Diabetes): Most models execute in under 1 second (except MLP).
- Medium datasets (Loan Prediction, Cyberbullying): Performance varies widely by model and implementation. Some models (MLP) are outliers in runtime.
- Large datasets (Wiretap): Training can take several minutes, especially with Random Forest, AdaBoost, and MLP.
- LightGBM: Fastest across all datasets.
- XGBoost: Consistently fast, second only to LightGBM.
- Logistic Regression: Efficient on small data, but can spike in runtime for certain medium datasets.
- Random Forest & AdaBoost: Slow on large datasets; can be impractical for real-time use.
- MLP: Slowest, especially on large/medium datasets.
- Intelex & CuPy: Both can provide major speed-ups, but the effect depends on the dataset and model.
- For small datasets: Acceleration is noticeable, but even non-accelerated models are fast.
- For large datasets: Acceleration is crucial for practical runtimes, but not all models benefit equally.
- Note: Some models or array conversions can cause CuPy to worsen performance (e.g., Logistic Regression on Loan Prediction).
- LightGBM is the most robust and efficient overall.
- CuPy and Intelex accelerations are not universally beneficial—always profile for your use case.
- Dataset size and model choice both impact execution time, but model/implementation often matters more than raw size.
Feel free to fork this repository, create a new branch, and submit a pull request with your changes. Please make sure to document it well.
This project is licensed under the MIT License - see the LICENSE file for details.
Diabetes Dataset (Version V1). (n.d.). [Dataset]. kaggle. https://www.kaggle.com/datasets/asinow/diabetes-dataset/data
Maree, A. (2025). Student Performance Prediction [Dataset]. In Kaggle (Version V2). Keggal. https://www.kaggle.com/datasets/amrmaree/student-performance-prediction
Mirsky, Yisroel, Doitshman, Tomer, Elovici, Yuval, & Shabtai, Asaf. (2018). Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection. In The Network and Distributed System Security Symposium (NDSS) 2018.
CyberBullying Detection Dataset. (n.d.). [Dataset]. In Kaggle (Version V2). Keggal. https://www.kaggle.com/datasets/sayankr007/cyber-bullying-data-for-multi-label-classification?select=final_hateXplain.csv
Star, Ethical. (2025). Loan Prediction. Kaggle.com.. https://www.kaggle.com/datasets/ethicalstar/loan-prediction/code