Authors: Ci Xu, Danish Karlin Isa, Jia Quan (Joseph) Lim, Lik Hang (Alex) Wong
This project is part of the coursework for DSCI 591 Capstone Project, a Master of Data Science course at the University of British Columbia. This project is done in partnership with UBC Cybersecurity (the Partner).
Currently, the Partner manually reviews suspicious emails reported by users – a process made difficult by the high-volume of reported emails and time sensitivity in addressing these threats.
This project aims to support the Partner in automating the existing review-to-recall workflow. Our solution will enhance threat detection speed, reduce manual workload, and strengthen UBC’s cybersecurity.
The deliverables include a containerised web service consisting of a trained machine learning (ML) pipeline that classify reported emails, a dashboard for performance monitoring, and comprehensive documentation.
PhishSense-v2 contains an email classifier with a stacked architecture implemented using scikit-learn’s StackingClassifier. It comprises four XGBClassifier models whose predictions are combined by a final SVC to produce the final classification. The subject, body and extracted features (header and body) are passed as input and the output would be the probability of an email being benign and malicious.
flowchart TD
A1[Text] --> A2[Subject]
A1 --> A3[Body]
B1[Non-text features] --> B2[Header]
B1 --> B3[Body]
A2 --> X1[XGBClassifier-1]
A3 --> X2[XGBClassifier-2]
B2 --> X3[XGBClassifier-3]
B3 --> X4[XGBClassifier-4]
X1 --> SVC[SVC]
X2 --> SVC
X3 --> SVC
X4 --> SVC
SVC --> Output[Predicted probabilities]
The following diagram describes the machine learning pipeline in PhishSense-v2.
flowchart TD
A["Email (in eml file format)"] --> B
B["Extract metadata from email"] --> C
C["Generate feature set from metadata"] -->D
D["Predict label with probabilities from feature set"]
The following diagram describes how PhishSense-v2 can be integrated into the current workflow.
flowchart TD
A["User reports a suspicious email"] --> B
B["IT ticketing system receives the report"] --> C
C["IT ticketing system automatically makes a HTTP POST request to PhishSense-v2 with reported email attached"] -->D
D["PhishSense-v2 generates predicted label with associated probabilities in JSON format"] --> E
E["PhishSense-v2 makes a HTTP POST request to the IT ticketing system with the predictions in JSON format attached"] --> F
F["IT ticketing system receives the prediction data and automatically populates the predicted label and associated probabilities"]
- Miniforge for environment and package management
- Quarto with TinyTeX and Chromium or Google Chrome for rendering reports
- LanguageTool and OpenJDK for generating features from emails
- All other dependencies listed in environment.yml
The repository is structured as follows. Please create any missing folders to ensure that script outputs can be saved.
-
data/: Contains raw and processed datasets used for training and evaluation. This will also include the train test split of the dataset. -
img/: Stores images used for documentation or reporting. -
model/: Includes saved trained PhishSense-v2. -
notebooks/: Contains Jupyter notebooks used for exploratory data analysis, prototyping, and documenting experiments throughout the project. -
report/: Contains project documentation, such as the proposal and final report. -
results/: Holds outputs from scripts, such as model selection result, cross-validation result, model evaluation result. -
scripts/: Contains utility scripts for data processing, feature extraction, model training, prediction, and other automation tasks used throughout the project.Script Description base_classifier_selection.pySelects and evaluates different base classifiers to determine which models perform best as components in the stacking ensemble. build_features_df.pyExtracts and processes features from email data, generating a DataFrame used for model training and prediction. build_metadata_df.pyExtracts metadata from raw email files and compiles it into a structured DataFrame for downstream processing. generate_email_list.pyScans the dataset directory and generates a list of email file paths for use in model training and model testing. generate_phishsense_metrics.pyComputes and outputs evaluation metrics for the current PhishSense-1 model. get_BERT_results.pyEvaluate BERT models and saves the results. model_architecture_selection.pyCompares different stacking architectures to identify the most effective structure for the email classifier. PhishSense-v2_results.pyAggregates and summarizes model evaluation results, such as performance metrics and confusion matrices. predict_email.pyLoads a trained model and predicts whether a given email is benign or malicious, outputting the results. stacking_final_estimator_selection.pySelects and evaluates the final estimator (meta-classifier) used in the stacking ensemble. train_model.pyTrains the PhishSense-v2 model using the prepared dataset and saves the trained model for later use. -
src/: Contains the source code for the PhishSense-v2 application, including core modules for data processing, feature extraction, model training, prediction, and API service implementation.Source Code Description app.pyMain entry point for the PhishSense-v2 API service; handles HTTP requests and routes predictions. content_features.pyExtracts and processes features from the email content. data_validation.pyValidates input data and ensures correct formatting before processing. feature_generation.pyCombines and generates the full feature set required for model prediction. header_features.pyExtracts relevant features from email headers. metadata_extraction.pyParses and extracts metadata from raw email files. model_evaluation.pyProvides utilities for evaluating model performance and generating metrics. prediction.pyLoads the trained model and performs predictions on processed email data.
The instructions in this section assumes that this project will be run on the provided virtual machine (VM).
It also assumes that the dataset of eml files reside on the VM at /data/dataset/ with the following directory structure:
/
└── data
└── dataset
├── CEO_Fraud_-_Gift_Cards
├── CEO_Fraud_-_Payroll_Update
├── CEO_Fraud_-_Wire_Transfers
├── Legitimate_Email_Confirmed
├── Phishing_-_3rd_Party
├── Phishing_-_Outbound
├── Phishing_-_UBC
├── Phishing_UBC_-_Outbound
├── Reply_Chain_Attack
├── Self-Phishing
├── Spam_-_False_Positives
├── Spam_-_Inbound
├── Spam_-_Outbound
└── SpearphishingWARNING: We advice against resetting the repository to its default state as it is extremely time-consuming to generate the required datasets and results.
NOTE: There is a known issue with rendering the proposal/final report, where Quarto is unable to render the PDF versions on the VM. To render the PDF version, you may have to do so on your Windows/MacOS device.
To reset the repository to its default state, run the following commands:
make remove_proposal
make remove_train_test_split
make remove_sampled_dataset
make remove_full_dataset
make remove_cached_dataset # this removes cached features that require an Internet and LanguageTool server connection
make remove_results
make remove_model
make remove_final-report-
Clone this GitHub repository onto the VM.
-
Open a command-line interface (CLI) and navigate to the root directory for this project.
-
Install and activate the environment using the following CLI commands:
conda env create --f environment.yml conda activate phishsense-v2
-
Create a folder named
dependenciesin the root directory. -
Download and install the latest snapshot of LanguageTool according to the installation instructions.
- Unzip the downloaded folder inside
dependenciesfolder. - During installation for fastText language identification model, download
lid.176.binmodel. - Remember to create the
server.propertiesas per the installation instructions.
- Unzip the downloaded folder inside
-
Once LanguageTool is installed, navigate to the directory that LanguageTool is installed in and start the LanguageTool HTTP server using the following CLI command:
cd PATH_TO_LANGUAGETOOL_DIRECTORY java -cp languagetool-server.jar org.languagetool.server.HTTPServer --config server.properties --port 8081 --allow-originNote that PhishSense-v2 sends HTTP requests to the LanguageTool server using
localhostat port8081. If you wish to use a different port number or host a LanguageTool server using a domain other thanlocalhost, you must specify the URL withinscripts/build_features.df.We recommend using
screento spin up a persistent instance of the LanguageTool HTTP server.
-
Split the dataset into train, test and downsampled train sets:
make train_test_split
-
Generate the data required for training:
make full_dataset
NOTE: This step requires an Internet connection and the LanguageTool HTTP server to be running. This will be a time-consuming step as it involves checking of name servers for every email. We recommended appending
nohupto the command provided to ensure that the script still runs even when the CLI is closed, or when the user is logged out. -
Once the data is successfully generated, generate the model object:
make model
-
The model will be available in the
modelsubdirectory as a pickle file (phishsense-v2.pkl).
Run the following CLI command:
conda activate phishsense-v2
python scripts/predict_email.py --email_path PATH_TO_EML_FILE --model_path PATH_TO_MODEL_PKL --show yNote that you must have generated PhishSense-v2.pkl before running the CLI command above.
Run the following CLI command:
make proposalThe proposal will be available in the report/proposal/ subdirectory as a HTML and PDF file.
-
Split the dataset into train, test and downsampled train sets (you may skip this if you did it when generating the model):
make train_test_split
-
Generate the required data:
make sampled_dataset make full_dataset # you may skip this if you did it when generating the modelNOTE: This step requires an Internet connection and the LanguageTool HTTP server to be running. This will be a time-consuming step as it involves checking of name servers for every email. We recommended appending
nohupto the command provided to ensure that the script still runs even when the CLI is closed, or when the user is logged out. -
Generate the model object for PhishSense-v2:
make model
-
Generate the results required for the final report:
make results
-
Render the final report:
make final_report
The final report will be available in the
report/final-report/subdirectory as a HTML and PDF file.
While viewing the proposal and final report in PDF is possible, we recommend viewing the proposal and final report in HTML format:
-
Open this repository with all the generated results in VSCode.
-
Install the Live Preview extension in VSCode.
-
Right click on the
htmlfile and selectShow Preview. -
You may either view the report in the web browser within VSCode, or copy the link and view it in your web browser.
The following documentation provide the guide to host the PhishSense-v2 API.
If container image is not present, run the following command in the terminal from the root directory of this repository.
sudo TMPDIR=/tmp podman build --network=host --tag PhishSense-v2 .There are two options to run the service from the root directory of this repo.
-
Run the following CLI command:
sudo podman run --network=host -v ./model:/app/model:ro localhost/PhishSense-v2
Ensure that the necessary mount required volumes are passed as arguments. The volumes used are directories that exist in the Virtual Machine.
-
Run the following CLI command:
sudo podman-compose up
POST /predictThe API takes in two different file formats:
-
An
emlfile, or multipleemlfiles with-Ftags:curl -X POST http://localhost:5000/predict "email_file=@PATH_TO_EML_FILE" # for single eml file curl -X POST http://localhost:5000/predict \ -F "email_file=@PATH_TO_EML_FILE_1" \ -F "email_file=@PATH_TO_EML_FILE_2" # for multiple eml files
-
zipfolder that contains multipleemlfiles:curl -X POST http://localhost:5000/predict -F "email_zip=@PATH_TO_ZIP_FILE"
The response consists of a JSON containing the probability that an email is benign and malicious (both sums to 1).
[
{
"filename":"0_message.eml",
"prediction":{
"probability":{
"benign":0.7422,
"malicious":0.2577
}
}
},
...
]© 2025 The University of British Columbia.