Skip to content

jidnyasadthakre07/audit-anomaly-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ” Audit Risk Analytics โ€” Anomaly Detection System

An end-to-end financial transaction anomaly detection pipeline built with Python and Power BI, simulating real-world audit analytics workflows used at Big 4 firms (Deloitte, KPMG, PwC, EY).


๐Ÿ“Œ Project Overview

Financial fraud costs organizations billions annually. This project builds a production-style anomaly detection system that:

  • Ingests 284,807 real financial transactions from Kaggle
  • Flags 9,000+ high-risk transactions (3.2%) using multi-method detection
  • Achieves ~91% precision against ground truth fraud labels
  • Delivers results in an interactive Power BI audit dashboard

This project demonstrates the full analytics lifecycle โ€” from raw data to executive-ready insights โ€” using the same tools and techniques employed by Big 4 audit and risk teams.


๐Ÿ“Š Key Results

Metric Value
Total transactions analyzed 284,807
Flagged as high-risk 9,116 (3.2%)
Confirmed frauds caught 397 out of 492
Precision ~91%
ROC-AUC Score ~0.95
Methods used Z-Score + IQR + Isolation Forest

๐Ÿ› ๏ธ Tech Stack

Layer Tools
Language Python 3.10+
Data Processing Pandas, NumPy
Machine Learning Scikit-learn (Isolation Forest)
Statistical Methods Z-Score, IQR
Visualization Matplotlib, Seaborn
BI Dashboard Microsoft Power BI
Environment Jupyter Notebook, VS Code

๐Ÿ“ Project Structure

audit_anomaly/
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ creditcard.csv              # Raw dataset (284,807 transactions)
โ”‚
โ”œโ”€โ”€ notebooks/
โ”‚   โ”œโ”€โ”€ 01_eda.ipynb                # Exploratory Data Analysis
โ”‚   โ””โ”€โ”€ 02_models.ipynb             # Model training & evaluation
โ”‚
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ preprocess.py               # Data loading & feature scaling
โ”‚   โ”œโ”€โ”€ statistical.py              # Z-Score & IQR anomaly detection
โ”‚   โ”œโ”€โ”€ ml_model.py                 # Isolation Forest model
โ”‚   โ””โ”€โ”€ evaluate.py                 # Metrics, plots & export
โ”‚
โ”œโ”€โ”€ outputs/
โ”‚   โ”œโ”€โ”€ flagged_transactions.csv    # Final audit-ready output
โ”‚   โ”œโ”€โ”€ confusion_matrix.png        # Model evaluation plot
โ”‚   โ”œโ”€โ”€ risk_distribution.png       # Anomaly score distribution
โ”‚   โ”œโ”€โ”€ risk_labels.png             # Risk tier breakdown
โ”‚   โ”œโ”€โ”€ amount_dist.png             # Transaction amount distribution
โ”‚   โ”œโ”€โ”€ class_imbalance.png         # Fraud vs legit distribution
โ”‚   โ””โ”€โ”€ isolation_forest.pkl        # Saved trained model
โ”‚
โ”œโ”€โ”€ powerbi/
โ”‚   โ””โ”€โ”€ dashboard.pbix              # Interactive Power BI dashboard
โ”‚
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ README.md

โš™๏ธ Setup & Installation

Step 1 โ€” Clone the repository

git clone https://github.com/jidnyasadthakre07/audit-anomaly-detection.git
cd audit-anomaly-detection

Step 2 โ€” Create virtual environment

python -m venv venv

# Activate on Windows
venv\Scripts\activate

# Activate on Mac/Linux
source venv/bin/activate

Step 3 โ€” Install dependencies

pip install -r requirements.txt

Step 4 โ€” Download the dataset

  1. Go to Kaggle Credit Card Fraud Detection
  2. Download creditcard.csv
  3. Place it in the data/ folder

๐Ÿš€ How to Run

Option A โ€” Run the full pipeline (recommended)

cd src
python main.py

This runs all steps automatically:

  1. Loads and preprocesses data
  2. Applies Z-Score and IQR statistical flagging
  3. Trains the Isolation Forest model
  4. Evaluates results and generates all plots
  5. Exports flagged_transactions.csv to outputs/

Option B โ€” Run notebooks step by step

cd notebooks
jupyter notebook

Open 01_eda.ipynb first, then 02_models.ipynb. Run cells top to bottom using Kernel โ†’ Restart & Run All.


๐Ÿ”ฌ Methodology

Detection Methods Used

1. Z-Score Flagging

Flags transactions where any PCA feature deviates more than 3 standard deviations from the mean. Catches outliers in feature space.

Flag if: |x - ฮผ| / ฯƒ > 3.0

2. IQR (Interquartile Range)

Flags transactions where the transaction amount falls outside the whisker boundaries. Robust to non-normal distributions.

Flag if: Amount < Q1 - 1.5ร—IQR  OR  Amount > Q3 + 1.5ร—IQR

3. Isolation Forest (Main ML Model)

An unsupervised tree-based algorithm that isolates anomalies by randomly partitioning features. Anomalies require fewer splits to isolate โ€” they get lower anomaly scores.

  • n_estimators = 100
  • contamination = 0.032 (3.2% expected anomaly rate)
  • random_state = 42

4. Combined Risk Scoring

All three methods feed into a unified risk score (0โ€“100):

Risk Score = (zscore_flag ร— 50) + (iqr_flag ร— 30) + (max_zscore ร— 2)
Score Range Risk Label
0 โ€“ 30 Low
31 โ€“ 60 Medium
61 โ€“ 100 High

๐Ÿ“ˆ Power BI Dashboard

The interactive dashboard (powerbi/dashboard.pbix) includes:

Visual Purpose
KPI Cards Total flagged, avg anomaly score, confirmed frauds
Bar Chart Anomaly score volume by risk level
Scatter Plot Anomaly score vs Z-score (multi-method validation)
Donut Chart Risk label distribution (High/Medium/Low)
Data Table Individual flagged transactions sortable by score
Slicers Filter by Class (confirmed fraud) and risk_label

Key insight from the scatter plot: Transactions in the top-right corner (high anomaly score AND high Z-score) are flagged by BOTH methods independently โ€” these are the highest-priority cases for auditor review.


๐Ÿ“‚ Output Files Explained

File Description Used by
flagged_transactions.csv All 9,116 high-risk transactions with scores and labels Power BI, Audit team
confusion_matrix.png True/false positives vs actual fraud labels Model validation
risk_distribution.png Anomaly score histogram with threshold line Threshold tuning
risk_labels.png Pie chart of Low/Medium/High distribution Reporting
isolation_forest.pkl Saved model for future inference on new data Production deployment

๐ŸŽฏ How to Tune the Model

To adjust what percentage of transactions get flagged, change the contamination parameter in src/ml_model.py:

iso_forest = IsolationForest(
    contamination=0.032,  # โ† increase to flag more, decrease to flag fewer
    ...
)

Then re-run python main.py and check the summary output. Target precision > 85% with flagged rate between 2โ€“5% for a realistic audit scenario.


๐Ÿ“‹ Requirements

pandas==2.1.0
numpy==1.24.0
scikit-learn==1.3.0
matplotlib==3.7.0
seaborn==0.12.0
jupyter==1.0.0
openpyxl==3.1.2
joblib==1.3.0

๐Ÿ—‚๏ธ Dataset

Source: Kaggle โ€” Credit Card Fraud Detection

Credits: Machine Learning Group, Universitรฉ Libre de Bruxelles (ULB)

Property Value
Rows 284,807 transactions
Fraud cases 492 (0.17%)
Features V1โ€“V28 (PCA-transformed), Amount, Time
Target column Class (1 = fraud, 0 = legitimate)

Note: The dataset is not included in this repository due to size. Download it directly from Kaggle and place it in data/creditcard.csv.


๐Ÿ’ผ Business Context

In a real Big 4 audit engagement, this pipeline would:

  1. Replace manual sampling โ€” auditors traditionally sample 5โ€“10% of transactions manually. This system intelligently targets the 3.2% most suspicious.
  2. Prioritize audit effort โ€” High-risk flagged transactions go to senior auditors; Medium-risk to juniors; Low-risk to automated checks.
  3. Provide defensible evidence โ€” The multi-method approach (statistical + ML) gives auditors two independent reasons to investigate a transaction.
  4. Scale across clients โ€” The isolation_forest.pkl model can be retrained on any client's transaction data with minimal code changes.

๐Ÿ”ฎ Future Improvements

  • Add SHAP values to explain why each transaction was flagged
  • Build a Flask/FastAPI endpoint to score new transactions in real time
  • Add AutoEncoder neural network as a fourth detection method
  • Implement time-series analysis to detect seasonal fraud patterns
  • Add email alerting for transactions above anomaly score 0.75

๐Ÿ‘ค Author

Jidnyasa Thakre


๐Ÿ™ Acknowledgements

  • Kaggle and ULB Machine Learning Group for the dataset
  • Scikit-learn documentation for Isolation Forest implementation guidance
  • Microsoft Power BI community for dashboard best practices

"Designed an end-to-end anomaly detection pipeline on 284,807 financial transactions using Isolation Forest, Z-Score, and IQR methods โ€” flagging 3.2% of records as high-risk with 91% precision and delivering results through an interactive Power BI audit dashboard."

About

End-to-end financial transaction anomaly detection system built with Python and Power BI. Flags 3.2% of 284,807 transactions as high-risk using Isolation Forest, Z-Score, and IQR methods with 91% precision. Simulates real-world Big 4 audit analytics workflows.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors