Identifying Fraudulent Job Postings with ML methods.

Author:

USAGE

Train

To train models please run:

/bin/bash ./train.sh

Test

To test models please run:

/bin/bash ./test.sh

Data Processing and Update (Beta)

To update your own train/test data, please manualy modify the csv files in ./Data/train.csv and ./Data/test.csv

Introduction

Increase usage of online employment websites have lead to increase in fraudulent job postings.
Fraudulent job posting have two main goals:
- Acquire confidential personal information.
- Solicit unlawful payments
GOAL: Develop a Natural Language Processing (NLP) model that is able to detect fake job postings based on the textual description of the jobs.
Utilize different kind of machine learning models and algorithms to identify patterns or anomalies.

Data

Data Acquisition

Employment Scam Aegean Dataset (EMSCAD http://emscad.samos.aegean.gr/)
Published by the University of the Aegean’s Laboratory of Information & Communication Systems Security
Publicly Available Dataset that contains 17,880 real-life job postings online.
- 17,014 Legitimate Job Postings and 866 fraudulent job postings.
EMSCAD records were manually annotated and classified into two categories between 2012-2014.

Data Cleaning and Preprocessing

Base on HTML/text features
- Concatenate String & Nominal features if not existed before
Remove
- Stop words
- Punctuations
- etc.
Keep
- Numbers
- Text
- etc.
Traget: Fraudulent
- T == 1
- F == 0

Model

Traditional Models - Bag of Words

Two approaches for Bag of Words to generate features:

Custom Top n-words Model
- From TOP 50 - 1200 words (in frequency)

CountVectorizer
- n-gram (1, 2, 3, 4)
- including 1, 2, 3, 4-grams repectively, as well as the combination of different grams.

The Features are then applied to Logistic Regression and Random Forest models.

Neural Network

LSTM

General typical 3-layer LSTM model.

BERT

Bidirectional Encoder Representations from Transformers
Based on the pre-trained model from Google with our own dataset to fine tune.

Evaluation

Traditional Models work well on this case.
Neural Network Models are good in some way
- BERT reduces false positives
- LSTM reduces false negative
- Cannot achieve both in a single model
- Need more complex structure to capture more relationships and more data to train the model.

References

Project Structure

.
├── README.md    
├── train.sh   #shell script for train
├── test.sh    #shell script for test
├── Code/
│   ├── 00_Data_processing.py    #Process the raw data to usable data and do train-test split
│   ├── 01_RandomForest.py       
│   ├── 02_LogisticRegression.py 
│   ├── 03_Bow.py                
│   ├── 04_FeatureEnginnering.py 
│   ├── 05_LSTM.py               
│   ├── 06_BERT.py                    
│   ├── test.py                  
│   ├── train.py
│   └── Jupyter Notebooks        #This folder stores the raw jupyter notebooks used to compose and tune models/
│       ├── BERT.ipynb              
│       ├── BoWs_feature.ipynb      
│       ├── LSTM.ipynb              
│       └── Traditional_Model.ipynb         
├── Data/
│   ├── emscad_v1.csv            #Raw data
│   ├── text.csv                 #Processed usable data
│   ├── train.csv                #Train data after train-test split 
│   ├── test.csv                 #Test data after train-test split
│   ├── train_balanced.csv       #Balanced train data 
│   ├── test_balanced.csv        #Balanced test data
│   ├── Model_Compare_BERT.xlsx  #Recordings of BERT hyperparameter tuning
│   └── Model_Compare_LSTM.xlsx  #Recordings of LSTM hyperparameter tuning     
├── Model   #Pre-trained models and temporary model storage/
│   ├── BERT_MODEL  #Pre-trained and tuned BERT model                       
│   ├── LSTM_MODEL  #Pre-trained and tuned LSTM model
│   └── ...         #Other temporary model generated during training.
└── Asset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Identifying Fraudulent Job Postings with ML methods.

Author:

USAGE

Train

Test

Data Processing and Update (Beta)

Introduction

Data

Data Acquisition

Data Cleaning and Preprocessing

Model

Traditional Models - Bag of Words

Neural Network

LSTM

BERT

Evaluation

References

Project Structure

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Asset		Asset
Code		Code
Data		Data
Model		Model
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
test.sh		test.sh
train.sh		train.sh

Eric-Miao/INFO251-Job_Scam_Detection

Folders and files

Latest commit

History

Repository files navigation

Identifying Fraudulent Job Postings with ML methods.

Author:

USAGE

Train

Test

Data Processing and Update (Beta)

Introduction

Data

Data Acquisition

Data Cleaning and Preprocessing

Model

Traditional Models - Bag of Words

Neural Network

LSTM

BERT

Evaluation

References

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages