This repository implements an automated ML pipeline for analyzing malicious URLs. The code is refactored for ease of use, modularity, and automation. The project includes:
- Automated Data Preprocessing:
Loads, validates, and preprocesses the malicious URL dataset. - URL Feature Extraction:
Modularized functions to extract enhanced features from each URL. - Multi-Model Training & Evaluation:
Trains and evaluates several machine learning algorithms (LDA, Logistic Regression, SVM, Random Forest) concurrently. - Automated Visualization & Reporting:
Creates and saves plots for feature correlations, model confusion matrices, and more. - Energy & CO₂ Tracking:
Optionally tracks and reports energy and CO₂ emissions using CodeCarbon.
Make sure you have Conda installed before proceeding. The following packages are required:
- Python 3.8 (or later)
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- joblib (for HPC only)
- codecarbon (must pip install)
Clone this repository to your local machine:
git clone https://github.com/dfromond3/Malicious-URL-HPC.git
cd Malicious-URL-HPC
Create a new Conda environment with the desired Python version:
conda create -n ml_env python=3.10
conda activate ml_env
Using Conda for most packages:
conda install numpy scipy scikit-learn matplotlib pandas seaborn -c conda-forge
Using pip for CodeCarbon:
pip install codecarbon
Download the Kaggle Dataset and save it as malicious_phish.csv in the same folder as the Python script.
To run the script, execute:
python intensive_ml_model.py
Clone this repository to your desired HPC directory:
git clone https://github.com/dfromond3/Malicious-URL-HPC.git
cd Malicious-URL-HPCCreate a new Conda environment with the desired Python version using a symlink (for example):
mv ~/.conda /storage/ice1/1/7/dfromond3/.conda
ln -s /storage/ice1/1/7/dfromond3/.conda ~/.conda
module load anaconda3
conda create --name hpc_env python=3.8 -y
conda activate hpc_env
Using Conda for most packages:
conda install -c conda-forge pandas numpy matplotlib seaborn scikit-learn joblib -y
Using pip for CodeCarbon:
pip install codecarbon
Download the Kaggle Dataset and save it as malicious_phish.csv in the same folder as the Python script.
First ensure a job has started. To start a job, please execute one of the following (for example):
salloc -N1 --ntasks-per-node=4 -t1:00:00
salloc -N2 --ntasks-per-node=4 -t1:00:00
salloc --gres=gpu:H100:1 --ntasks-per-node=3
salloc --gres=gpu:H100:2 --ntasks-per-node=1
For documentation on how to start a job, please refer to this webpage.
To run the script after the job has started:
srun python intensive_ml_model_hpc.py