This repository contains the Python code and data to reproduce the results presented in the paper "A pipeline and comparative study of 12 machine learning classifiers for text classification".
The following steps are required to run the code:
- Python 3.6.x is required, a check is specific put into the code before it continues.
- Jupyter notebook server is required
- Enron spam corpus dataset is used for this paper, included is the tar zip folders containing the spam emails.
- AV application's will flag some emails as malicious/virus or a scam, this is fine and restore where necessary.
- Ensure all pip dependencies are installed as listed in requirements.txt
- Run through the steps laid out in the notebook.