This project focuses on detecting fraudulent insurance claims using machine learning techniques. The dataset contains various policyholder details, claim information, and incident reports. The goal is to preprocess the data and build a model that classifies claims as fraudulent or non-fraudulent.
- The dataset consists of multiple categorical and numerical features related to insurance claims.
- The target variable (
fraud_reported) indicates whether a claim is fraudulent (1) or not (0). - Some features include policy details, incident descriptions, and claim amounts.
Several preprocessing steps were applied to clean and prepare the dataset for modeling:
-
Handling Missing Values
- Replaced
?withNaNand imputed missing categorical values with'Unknown'.
- Replaced
-
Feature Engineering
- Converted
fraud_reportedfrom categorical (Y/N) to binary (1/0). - Converted
policy_bind_dateandincident_dateintodatetimeformat. - Dropped unnecessary columns (
policy_number,insured_zip,incident_location, etc.). - Applied one-hot encoding to categorical features.
- Converted
-
Data Export
- The cleaned dataset is saved as
insurance_claims_preprocessed.csv.
- The cleaned dataset is saved as
📁 Fraud-Detection-Insurance
│── 📂 data
│ ├── insurance_claims.csv # Raw dataset
│ ├── insurance_claims_preprocessed.csv # Cleaned dataset
│── 📂 src
│ ├── preprocess.py # Data preprocessing script
│ ├── model_train.py # Model training and evaluation
│── README.md # Project documentation
The project aims to train a machine learning model that can effectively classify fraudulent and non-fraudulent insurance claims. Performance metrics such as accuracy, precision, recall, and F1-score will be evaluated.
- Feature selection and dimensionality reduction for better model performance.
- Implementing advanced models like ensemble learning and deep learning.
- Deploying the model using Flask or FastAPI for real-time fraud detection.