This document provides a comprehensive overview of the Vehicle-Insurance-DataPipeline-MLops- project, an end-to-end MLOps solution for building, deploying, and monitoring machine learning models for vehicle insurance prediction.
- Project Overview
- Architecture
- Prerequisites & Dependencies
- Installation & Setup
- Project Structure
- Pipeline Flow
- Stage-wise Pipeline Details
- Code Structure
- Usage
- Testing
- Deployment
- Monitoring & Logging
- Contributing
- License
The Vehicle Insurance Data Pipeline is a production-ready MLOps framework designed to automate the machine learning lifecycle for predicting vehicle insurance claims. It implements best practices in data engineering, model training, evaluation, and deployment using cloud-native technologies.
- Modular Pipeline: Each stage is independent and testable.
- Cloud Integration: Seamless integration with AWS S3 for storage and deployment.
- Data Validation: Automated schema validation and data quality checks.
- Model Versioning: Version control for models and artifacts.
- Scalable Architecture: Designed for production workloads.
- Monitoring: Comprehensive logging and error handling.
This project predicts vehicle insurance claim probabilities based on historical data, enabling insurers to assess risk and set premiums more accurately.
- Programming Language: Python 3.8+
- ML Framework: Scikit-learn, Pandas, NumPy
- Cloud Storage: AWS S3
- Database: MongoDB
- Containerization: Docker
- Version Control: Git
- CI/CD: GitHub Actions (Automated Build & Deploy)
- Web Framework: FastAPI with Uvicorn
- Deployment: AWS EC2 with Docker
The project follows a layered architecture:
┌─────────────────┐
│ Application │ (app.py, demo.py)
├─────────────────┤
│ Pipeline │ (training_pipeline.py, prediction_pipeline.py)
├─────────────────┤
│ Components │ (data_ingestion, validation, etc.)
├─────────────────┤
│ Services │ (cloud_storage, data_access, etc.)
├─────────────────┤
│ Utilities │ (logger, exception, utils)
├─────────────────┤
│ Configuration │ (config/, constants/)
└─────────────────┘
- Raw data → MongoDB
- Ingestion → Validation → Transformation
- Training → Evaluation → Deployment
- Artifacts → S3 Storage
- Python 3.8 or higher
- Docker (for containerized deployment)
- AWS CLI configured with appropriate permissions
- MongoDB instance (local or cloud)
See requirements.txt for full list. Key packages:
pandas,numpy,scikit-learnboto3(AWS SDK)pymongo(MongoDB driver)imbalanced-learn(SMOTEENN)pyyaml(Configuration)
-
Clone the Repository
git clone <repository-url> cd Vehicle-Insurance-DataPipeline-MLops-
-
Create Virtual Environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies
pip install -r requirements.txt
-
Configure Environment
- Copy
.env.exampleto.env - Set AWS credentials and MongoDB URI
- Update
config/model.yamlandconfig/schema.yaml
- Copy
-
Run Setup
python setup.py install
| File | Description |
|---|---|
| app.py | FastAPI application entry point; serves prediction API on port 5000. |
| demo.py | Training pipeline entry point; runs full ML pipeline. |
| Dockerfile | Docker image build instructions for containerized deployment. |
| requirements.txt | List of Python dependencies (scikit-learn==1.7.2, boto3, etc.). |
| pyproject.toml | Python packaging and dependency configuration. |
| setup.py | Script for installing the project as a package. |
| template.py | Template for creating new modules/scripts. |
| LICENSE | MIT License file. |
| README.md | Main project documentation and usage instructions. |
| crashcourse.txt | Learning notes and crash course summary. |
| project_flow.txt | Describes the project workflow. |
| workflow.txt | Workflow steps in text format. |
| Folder | Description |
|---|---|
| src/ | Main source code, organized into submodules (see below for details). |
| config/ | Configuration files (model.yaml, schema.yaml). |
| templates/ | HTML templates for FastAPI web interface. |
| static/ | Static CSS/JS files for web interface. |
| .github/ | GitHub Actions CI/CD workflows (awes.yaml). |
| notebook/ | Jupyter notebooks and data files for experimentation. |
| artifact/ | (Local only) Output artifacts from pipeline runs, organized by timestamp. |
| logs/ | (Local only) Log files generated by the application. |
- cloud_storage/: Utilities for interacting with cloud storage (e.g., AWS S3).
- components/: Core pipeline components:
data_ingestion.py: Loads data from MongoDB to feature store.data_validation.py: Validates data schema and integrity.data_transformation.py: Transforms and preprocesses data.model_trainer.py: Trains machine learning models.model_evaluation.py: Evaluates trained models.model_pusher.py: Pushes models to deployment targets.
- configuration/: Configuration management (e.g., MongoDB connection).
- constants/: Project-wide constants (e.g., file paths, column names).
- data_access/: Data access layer for reading/writing data.
- entity/: Entity definitions for configuration and artifacts.
- exception/: Custom exception classes.
- logger/: Logging utilities.
- pipline/: Pipeline orchestration (training and prediction).
- utils/: Utility functions (e.g., file I/O, helpers).
- test_utilities/: Development and testing scripts:
test_aws_connection.py: Tests AWS connection and credentials.check_s3_bucket.py: Checks S3 bucket status.check_s3_versions.py: Lists S3 object versions.
flowchart TD
A[Start: app.py/demo.py] --> B[data_ingestion.py\n(Data Ingestion)]
B --> C[data_validation.py\n(Data Validation)]
C --> D[data_transformation.py\n(Data Transformation)]
D --> E[model_trainer.py\n(Model Training)]
E --> F[model_evaluation.py\n(Model Evaluation)]
F --> G[model_pusher.py\n(Model Pushing)]
G --> H[Artifacts Saved to S3]
H --> I[End]
- Purpose: Sets up all configuration, constants, and logging for the pipeline.
- Key Steps:
- Loads configuration files (YAML, TOML, etc.).
- Sets up logging and exception handling.
- Defines constants and environment variables.
- Artifacts Produced:
- Configuration objects
- Log files
- Purpose: Loads raw data from MongoDB (or other sources) and saves it to a feature store (CSV or DataFrame).
- Key Steps:
- Connects to MongoDB using configuration.
- Exports the specified collection as a DataFrame.
- Saves the raw data to a local feature store directory.
- Splits the data into train and test sets.
- Artifacts Produced:
- Raw data file
- Train and test data files
- DataIngestionArtifact (metadata about the ingestion)
- Technical Details: Uses
Proj1Dataclass for MongoDB interaction, supports configurable collection names.
- Purpose: Ensures the ingested data meets schema and quality requirements.
- Key Steps:
- Loads the schema definition (from YAML).
- Checks number and names of columns.
- Validates data types and missing values.
- Optionally, checks for data drift or anomalies.
- Artifacts Produced:
- Validation report (JSON/YAML)
- DataValidationArtifact (validation results and paths)
- Technical Details: Schema validation against predefined YAML config, raises exceptions on failure.
- Purpose: Prepares data for modeling by applying feature engineering and preprocessing.
- Key Steps:
- Loads train and test data.
- Applies transformations (e.g., scaling, encoding, imputation).
- Handles class imbalance (e.g., SMOTEENN).
- Saves transformed arrays and transformation objects (e.g., pipeline, scaler).
- Artifacts Produced:
- Transformed train and test arrays
- Preprocessing pipeline object
- DataTransformationArtifact (metadata and paths)
- Technical Details: Uses Scikit-learn pipelines for reproducibility, supports numerical and categorical features.
- Purpose: Trains a machine learning model using the transformed data.
- Key Steps:
- Loads transformed data and pipeline.
- Trains a classifier (e.g., RandomForest).
- Evaluates on validation/test set (accuracy, F1, etc.).
- Saves the trained model and metrics.
- Artifacts Produced:
- Trained model file
- Training metrics report
- ModelTrainerArtifact (model path, metrics)
- Technical Details: Configurable hyperparameters, supports multiple algorithms via estimator classes.
- Purpose: Compares the newly trained model with the best/production model.
- Key Steps:
- Loads the new and best (previous) models.
- Evaluates both on a holdout/test set.
- Compares metrics (e.g., F1 score).
- Decides if the new model should be promoted.
- Artifacts Produced:
- Evaluation report (comparison)
- ModelEvaluationArtifact (accept/reject decision)
- Technical Details: Uses S3Estimator for loading production models, threshold-based acceptance.
- Purpose: Deploys/promotes the accepted model to a production location (e.g., S3 bucket).
- Key Steps:
- Uploads the model and related artifacts to cloud storage (S3).
- Updates model registry or deployment pointer.
- Artifacts Produced:
- ModelPusherArtifact (deployment metadata)
- Model and pipeline objects in S3/production
- Technical Details: Integrates with SimpleStorageService for S3 operations, supports versioning.
- Purpose: Tracks pipeline execution, errors, and performance for debugging and auditing.
- Key Steps:
- Logs each stage's start, end, and errors.
- Stores logs in the
logs/directory. - Optionally, sends alerts or metrics to monitoring tools.
- Artifacts Produced:
- Log files
- Monitoring reports (if implemented)
- Purpose: Stores all intermediate and final outputs for reproducibility and traceability.
- Key Steps:
- Organizes artifacts by timestamp and stage.
- Stores raw, processed, and model files.
- Maintains metadata for each run.
- Artifacts Produced:
- All pipeline artifacts (data, models, reports)
- Purpose: Provides reusable functions, constants, and helpers for all pipeline stages.
- Key Steps:
- File I/O, YAML/JSON reading, saving/loading models.
- Common error handling and utility functions.
- Artifacts Produced:
- Utility modules
- Helper scripts
src/
├── cloud_storage/
│ └── aws_storage.py # AWS S3 utilities
├── components/
│ ├── data_ingestion.py # Data loading from MongoDB
│ ├── data_validation.py # Schema and quality validation
│ ├── data_transformation.py # Feature engineering
│ ├── model_trainer.py # ML model training
│ ├── model_evaluation.py # Model comparison
│ └── model_pusher.py # Model deployment
├── configuration/
│ └── mongo_db_connection.py # DB connection setup
├── constants/
│ └── __init__.py # Project constants
├── data_access/
│ └── proj1_data.py # Data access layer
├── entity/
│ ├── artifact_entity.py # Artifact definitions
│ ├── config_entity.py # Config entities
│ ├── estimator.py # Model estimators
│ └── s3_estimator.py # S3-based estimators
├── exception/
│ └── __init__.py # Custom exceptions
├── logger/
│ └── __init__.py # Logging setup
├── pipline/
│ ├── training_pipeline.py # Training orchestration
│ └── prediction_pipeline.py # Prediction pipeline
├── utils/
│ └── main_utils.py # Utility functions
└── test_utilities/
├── __init__.py # Package init
├── test_aws_connection.py # AWS connection testing
├── check_s3_bucket.py # S3 bucket status check
└── check_s3_versions.py # S3 versioning check
# Run training pipeline
python demo.py
# Run FastAPI application (local)
python app.py
# Access at http://localhost:5000
# Test utilities (from src/test_utilities/)
python src/test_utilities/test_aws_connection.py
python src/test_utilities/check_s3_bucket.pyfrom src.pipline.training_pipeline import TrainPipeline
pipeline = TrainPipeline()
pipeline.run_pipeline()- Unit tests for each component in
tests/(if available). - Integration tests for end-to-end pipeline.
- Run tests:
python -m pytest
# Build Docker image
docker build -t vehicle-insurance-mlops .
# Run container locally
docker run -p 5000:5000 \
-e MONGODB_URL="your-mongodb-url" \
-e AWS_ACCESS_KEY_ID="your-key" \
-e AWS_SECRET_ACCESS_KEY="your-secret" \
-e AWS_DEFAULT_REGION="us-east-1" \
vehicle-insurance-mlopsThis project uses GitHub Actions for automated deployment:
-
CI Job (Continuous Integration):
- Builds Docker image
- Pushes to AWS ECR
-
CD Job (Continuous Deployment):
- Runs on self-hosted EC2 runner
- Pulls latest image from ECR
- Deploys container on port 5000
Required GitHub Secrets:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_DEFAULT_REGIONECR_REPOMONGODB_URL
- Logs stored in
logs/directory - Structured logging with timestamps
- Error tracking and alerting (optional)
- Fork the repository
- Create a feature branch
- Make changes and add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For more information, contact the development team or refer to the README.md.