A comprehensive machine learning system for detecting credit card fraud using advanced algorithms and real-time prediction capabilities.
- Multiple ML Models: Random Forest, XGBoost, Logistic Regression, LightGBM, Gradient Boosting, and Neural Networks
- Real-time API: FastAPI-based REST API for real-time fraud predictions
- Comprehensive Evaluation: Advanced metrics including fraud-specific measures and cost analysis
- Robust Error Handling: Centralized error management and logging
- Automated Testing: Unit, integration, and performance tests
- Production Deployment: Docker and Kubernetes deployment configurations
- Monitoring: Prometheus metrics and Grafana dashboards
- Batch Processing: Support for both real-time and batch predictions
- Installation
- Quick Start
- Project Structure
- Usage
- API Documentation
- Model Training
- Evaluation
- Deployment
- Testing
- Configuration
- Contributing
- License
- Python 3.8+
- Docker and Docker Compose (for containerized deployment)
- kubectl (for Kubernetes deployment)
-
Clone the repository
git clone https://github.com/widgetwalker/credit-card-fraud-detection.git cd credit-card-fraud-detection -
Create virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Install the package in development mode
pip install -e .
Ensure your data is in CSV format with the following structure:
Time,V1,V2,V3,...,V28,Amount,Class
0,-1.3598071336738,-0.0727811733098497,2.53634673796948,...,0.133558376740292,-0.0210530535080104,149.62,0python scripts/train_model.py \
--data-path data/creditcard.csv \
--models random_forest xgboost logistic_regression \
--use-ensemble \
--cross-validation \
--output-dir models/trained_modelspython scripts/evaluate_model.py \
--data-path data/test_data.csv \
--model-path models/trained_models/random_forest_*.pkl \
--preprocessor-path models/trained_models/preprocessor_*.pkl \
--generate-plots \
--threshold-optimizationpython scripts/predict.py \
--data-path data/new_transactions.csv \
--model-path models/trained_models/best_model.pkl \
--preprocessor-path models/trained_models/preprocessor.pkl \
--output-path predictions/output.csvpython -m src.api.mainOr using Docker:
docker-compose up -dcredit-card-fraud-detection/
βββ src/
β βββ data/
β β βββ __init__.py
β β βββ preprocessing.py # Data preprocessing and feature engineering
β βββ models/
β β βββ __init__.py
β β βββ fraud_models.py # ML model implementations
β βββ evaluation/
β β βββ __init__.py
β β βββ metrics.py # Evaluation metrics and model comparison
β βββ api/
β β βββ __init__.py
β β βββ main.py # FastAPI application
β βββ utils/
β βββ __init__.py
β βββ config.py # Configuration management
β βββ error_handling.py # Error handling utilities
βββ scripts/
β βββ train_model.py # Model training script
β βββ evaluate_model.py # Model evaluation script
β βββ predict.py # Prediction script
β βββ deploy.py # Deployment script
βββ tests/
β βββ conftest.py # Test configuration
β βββ unit/ # Unit tests
β βββ integration/ # Integration tests
β βββ performance/ # Performance tests
βββ config/
β βββ model_config.yaml # Model configuration
β βββ deployment_config.yaml # Deployment configuration
βββ deployment/
β βββ docker/ # Docker deployment files
β β βββ Dockerfile
β β βββ docker-compose.yml
β β βββ nginx.conf
β βββ kubernetes/ # Kubernetes deployment files
β βββ deployment.yaml
β βββ service.yaml
β βββ ingress.yaml
β βββ hpa.yaml
βββ requirements.txt # Python dependencies
βββ setup.py # Package setup
βββ README.md # This file
The system includes comprehensive data preprocessing capabilities:
from src.data.preprocessing import DataPreprocessor
preprocessor = DataPreprocessor(
handle_missing='median',
scale_features=True,
encode_categorical=True,
feature_selection='mutual_info',
k_best=20
)
X_processed = preprocessor.fit_transform(X, y)Train multiple models with a single command:
from src.models.fraud_models import ModelFactory
# Create and train a model
model = ModelFactory.create_model('xgboost')
model.train(X_train, y_train)
# Evaluate the model
results = model.evaluate(X_test, y_test)
print(f"AUC: {results['auc_score']}")Comprehensive evaluation with fraud-specific metrics:
from src.evaluation.metrics import FraudDetectionMetrics
evaluator = FraudDetectionMetrics(cost_matrix=[1, 50, 10, 0])
metrics = evaluator.evaluate_basic_metrics(y_true, y_pred_proba)
fraud_metrics = evaluator.evaluate_fraud_specific_metrics(y_true, y_pred_proba)curl http://localhost:8000/healthcurl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{
"transaction": {
"Time": 0,
"V1": -1.3598071336738,
"V2": -0.0727811733098497,
...
},
"threshold": 0.5
}'curl -X POST "http://localhost:8000/predict/batch" \
-H "Content-Type: application/json" \
-d '{
"transactions": [
{"Time": 0, "V1": -1.36, ...},
{"Time": 1, "V1": 0.5, ...}
],
"threshold": 0.5
}'The API provides the following endpoints:
- GET
/health- Check if the service is healthy
- GET
/model/info- Get model information and parameters
- POST
/predict- Single transaction prediction - POST
/predict/batch- Batch prediction for multiple transactions
- POST
/model/reload- Reload the model (admin endpoint)
Full API documentation is available at http://localhost:8000/docs when the server is running.
- Random Forest: Ensemble method with good interpretability
- XGBoost: Gradient boosting with excellent performance
- Logistic Regression: Simple baseline model
- LightGBM: Fast gradient boosting
- Gradient Boosting: Scikit-learn implementation
- Neural Network: Deep learning approach
- Data Loading: Load and validate training data
- Preprocessing: Handle missing values, scaling, feature selection
- Model Training: Train multiple models with cross-validation
- Evaluation: Comprehensive evaluation with multiple metrics
- Ensemble Creation: Combine best models for improved performance
- Model Saving: Save trained models and preprocessing pipeline
python scripts/train_model.py --helpKey options:
--models: Models to train (multiple choices)--use-ensemble: Create ensemble model--cross-validation: Perform cross-validation--test-size: Test set size (default: 0.2)--output-dir: Directory to save models
- Accuracy, Precision, Recall, F1-Score
- ROC-AUC and PR-AUC
- Confusion Matrix
- Fraud Recall: Recall for fraud cases
- False Alarm Rate: False positive rate
- Cost-Weighted Error: Business cost analysis
- Precision-Recall Balance: Optimized threshold finding
- Threshold Optimization: Find optimal threshold
- Cost Analysis: Business impact analysis
- Model Comparison: Compare multiple models
- Cross-Validation: Robust performance estimation
python scripts/evaluate_model.py --helpKey options:
--generate-plots: Create evaluation plots--threshold-optimization: Find optimal threshold--cost-matrix: Custom cost matrix for analysis
-
Build and start services
docker-compose up -d
-
Check service status
docker-compose ps
-
View logs
docker-compose logs -f
-
Deploy to Kubernetes
python scripts/deploy.py --environment production --deployment-type kubernetes
-
Check deployment status
kubectl get pods -n fraud-detection
-
Access the service
kubectl get svc -n fraud-detection
python scripts/deploy.py --helpKey options:
--environment: Deployment environment--deployment-type: Docker or Kubernetes--skip-tests: Skip pre-deployment tests--dry-run: Perform dry run
# Run all tests
python -m pytest tests/ -v
# Run unit tests only
python -m pytest tests/unit/ -v
# Run integration tests
python -m pytest tests/integration/ -v
# Run performance tests
python -m pytest tests/performance/ -v
# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html- Unit Tests: Test individual components
- Integration Tests: Test component interactions
- Performance Tests: Test system performance and scalability
Edit config/model_config.yaml to customize:
- Model parameters
- Preprocessing options
- Feature engineering settings
- Evaluation metrics
Edit config/deployment_config.yaml to customize:
- Environment settings
- Security configurations
- Monitoring options
- Resource allocations
-
Install development dependencies
pip install -r requirements-dev.txt
-
Install pre-commit hooks
pre-commit install
-
Run code formatting
black src/ tests/ isort src/ tests/
-
Run linting
flake8 src/ tests/ mypy src/
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Typical performance on standard hardware:
- Training Time: 2-5 minutes for 100K samples
- Prediction Speed: ~1000 predictions/second
- Memory Usage: ~500MB for ensemble model
- API Latency: <100ms for single predictions
- Model Selection: Use XGBoost or LightGBM for best performance
- Feature Engineering: Apply domain-specific feature engineering
- Ensemble Methods: Combine multiple models for better accuracy
- Hardware: Use GPU for neural network training
- Caching: Enable Redis caching for frequently accessed data
- API Authentication: Token-based authentication
- Rate Limiting: Prevent API abuse
- Input Validation: Comprehensive input validation
- Error Sanitization: Secure error messages
- HTTPS: SSL/TLS encryption
- CORS: Cross-origin resource sharing protection
- Never commit sensitive data (API keys, passwords)
- Use environment variables for configuration
- Regularly update dependencies
- Monitor for security vulnerabilities
- Implement proper logging and monitoring
- Application Metrics: Request count, latency, errors
- Model Metrics: Prediction confidence, model drift
- System Metrics: CPU, memory, disk usage
- Business Metrics: Fraud detection rate, false positive rate
Access Grafana dashboards at http://localhost:3000 (admin/admin)
Configure alerts for:
- High error rates
- Model performance degradation
- System resource exhaustion
- Unusual prediction patterns
- Check the FAQ
- Review the documentation
- Open an issue
- Contact the development team
When reporting issues, please include:
- System information (OS, Python version)
- Error messages and stack traces
- Steps to reproduce the issue
- Expected vs actual behavior
Q: What data format is required? A: CSV format with numerical features and a binary target column.
Q: How do I handle missing values? A: The preprocessing pipeline automatically handles missing values based on configuration.
Q: Can I use custom models?
A: Yes, extend the base FraudDetectionModel class to add custom models.
Q: How do I deploy to cloud providers? A: Use the Kubernetes deployment files and adapt them to your cloud provider.
Q: What are the minimum system requirements? A: 2GB RAM, 2 CPU cores, 10GB disk space for basic deployment.
This project is licensed under the MIT License - see the LICENSE file for details.
- Scikit-learn team for the excellent ML library
- FastAPI team for the modern web framework
- The open-source community for various tools and libraries
- Real-time streaming predictions
- Advanced model interpretability
- AutoML capabilities
- Multi-model serving
- Advanced anomaly detection
- Graph-based fraud detection
- Model quantization
- GPU acceleration
- Distributed training
- Edge deployment support
Made with β€οΈ by the Fraud Detection Team