Project Goal: Design an anomaly detection system capable of automatically catching fraudulent transactions.
Synthetic Financial Datasets For Fraud Detection Synthetic datasets generated by the PaySim mobile money simulator https://www.kaggle.com/ntnu-testimon/paysim1
- step - Maps a unit of time in the real world. In this case 1 step is 1 hour of time.
- type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER
- amount - amount of the transaction in local currency
- name - Origcustomer who started the transaction
- oldbalance - Orginitial balance before the transaction
- newbalance - Origcustomer's balance after the transaction.
- nameDest - recipient ID of the transaction.
- oldbalanceDest - initial recipient balance before the transaction.
- newbalanceDest - recipient's balance after the transaction.
- isFraud - identifies a fraudulent transaction (1) and non fraudulent (0)
- isFlaggedFraud - flags illegal attempts to transfer more than 200.000 in a single transaction.
- Address of transactions
- Credit limit of card
- Salary
- How often person travels
Using an unsupervised model with the Random Cut forest algoroithm to identify anomalies in the credit card transactions.
When using Random Cut Forest, an anomaly score with low values indicates that the data point is considered “normal” whereas high values indicate the presence of an anomaly. The definitions of “low” and “high” depend on the application, but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.
The RCF algorithm in Amazon SageMaker works by first obtaining a random sample of the training data. Each subsample is organized into a binary tree by randomly subdividing bounding boxes until each leaf represents a bounding box containing a single data point. The anomaly score assigned to an input data point is inversely proportional to its average depth across the forest.
An supervised approach using XGBoost and a Linear Learner model. Hyperparameter tuning would be used in order to tune the model further. Identify / predict based on any input transaction whether it is classified as fraud or not fraud.
Fraudlent transactions are only happening on CASH_OUT and TRANSFER Consider skipping / dropping PAYMENT
Dropping these fields as they don't contain data that will help the model
- nameOrig
- nameDest
- isFlaggedFraud might not be accurate - consider making a new column for >$200K
Overwriting the TYPE column with numeric values TRANSFER = 0 CASH_OUT = 1 PAYMENT = 2
The goal is to predict fraudulent transactions.