In the digital age, spam detection is of paramount importance. This project aims to conduct an in-depth analysis of spam detection methods and their performance.
The dataset, named 'Spambase', is sourced from the UCI Machine Learning Repository. It contains various attributes of emails, including word frequencies, character frequencies, and more.
Hopkins, Mark, Reeber, Erik, Forman, George, and Suermondt, Jaap. (1999). Spambase. UCI Machine Learning Repository. https://doi.org/10.24432/C53G6X.
Before model training, the dataset underwent:
- Visualization: To understand the distribution of various features.
- Feature Selection: To identify the most relevant features.
- Transformation and Preprocessing: Including scaling and handling missing values.
- Feature Engineering: Creating new features to provide more information to the models.
Several machine learning algorithms were experimented with:
- Decision Trees: Providing clear visualization of decisions.
- Random Forests: An ensemble method aggregating predictions from multiple decision trees.
- Naïve Bayes: Suited for high-dimensional datasets.
- Multilayer ANNs: Neural networks capturing complex relationships in the data.
Models were evaluated using various metrics:
- Accuracy: Correctly predicted instances ratio.
- Precision: Correctly predicted positive observations to total predicted positives ratio.
- Recall: Correctly predicted positive observations to all actual positives ratio.
- F1-Score: Weighted average of Precision and Recall.
The project successfully demonstrated the capabilities of various machine learning algorithms in spam detection. The Random Forest algorithm emerged as the most balanced in terms of precision and recall. Further tuning and regular retraining are recommended for real-world deployment.
- Hopkins, Mark, Reeber, Erik, Forman, George, and Suermondt, Jaap. (1999). Spambase. UCI Machine Learning Repository. https://doi.org/10.24432/C53G6X.