A comprehensive end-to-end machine learning pipeline for network security and phishing detection, implementing modular components for data processing, model training, evaluation, and deployment.
- Overview
- Pipeline Architecture
- Components
- Configuration Files
- Installation
- Usage
- Deployment
- Project Structure
This project implements a production-ready machine learning pipeline for network security analysis. The system automatically processes network data, trains models, evaluates performance, and deploys the best-performing models to cloud storage (AWS/Azure).
Key Features:
- Modular component-based architecture
- Automated data validation and transformation
- Model evaluation with acceptance criteria
- Cloud deployment integration
- Comprehensive logging and artifact management
- MongoDB integration for data storage
The ML pipeline consists of six main components that work sequentially:
- Configuration:
Data Ingestion Config - Purpose: Fetches network security data from MongoDB database
- Outputs:
Data Ingestion Artifactscontaining raw data for processing
- Configuration:
Data Validation Config - Purpose: Validates data quality, schema, and integrity
- Checks: Missing values, data drift, schema compliance
- Outputs:
Data Validation Artifactswith validation reports
- Configuration:
Data Transformation Config - Purpose: Preprocesses and transforms data for model training
- Operations: Feature engineering, encoding, scaling, splitting
- Outputs:
Data Transformation Artifactswith processed datasets
- Configuration:
Model Trainer Config - Purpose: Trains machine learning models on processed data
- Outputs:
Model Trainer Artifactscontaining trained models and metrics
- Configuration:
Model Evaluation Config - Purpose: Evaluates model performance against acceptance criteria
- Decision Point: Determines if model is accepted or rejected
- Outputs:
Model Evaluation Artifactswith performance metrics
- Configuration:
Model Pusher Config - Purpose: Deploys accepted models to cloud storage
- Deployment: Pushes models to AWS S3 or Azure Blob Storage
- Outputs: Deployed model accessible for inference
from networksecurity.components.data_ingestion import DataIngestion- Connects to MongoDB database
- Extracts network security dataset
- Stores raw data for validation
from networksecurity.components.data_validation import DataValidation- Validates data schema
- Checks for missing values
- Detects data drift
- Generates validation reports
from networksecurity.components.data_transformation import DataTransformation- Feature engineering
- Categorical encoding
- Feature scaling
- Train-test splitting
from networksecurity.components.model_trainer import ModelTrainer- Trains ML models
- Performs hyperparameter tuning
- Generates training metrics
- Saves trained models
from networksecurity.components.model_evaluation import ModelEvaluation- Evaluates model performance
- Compares against baseline
- Applies acceptance criteria
- Determines deployment eligibility
from networksecurity.components.model_pusher import ModelPusher- Deploys accepted models
- Uploads to cloud storage
- Manages model versioning
Each component is configured through dedicated configuration files:
- π
Data Ingestion Config- Database connection, collection settings - π
Data Validation Config- Validation rules, drift thresholds - π
Data Transformation Config- Preprocessing parameters - π
Model Trainer Config- Model hyperparameters, algorithms - π
Model Evaluation Config- Acceptance criteria, metrics - π
Model Pusher Config- Cloud credentials, deployment settings
- Python 3.8+
- MongoDB
- AWS/Azure account (for deployment)
- Clone the repository
git clone <repository-url>
cd networksecurity- Install dependencies
pip install -r requirements.txt- Install the package
pip install -e .- Configure environment variables
# MongoDB
MONGO_DB_URL=<your-mongodb-url>
# AWS (Optional)
AWS_ACCESS_KEY_ID=<your-access-key>
AWS_SECRET_ACCESS_KEY=<your-secret-key>
AWS_REGION=us-east-1
AWS_ECR_LOGIN_URI=<your-ecr-uri>
ECR_REPOSITORY_NAME=networkssecurityExecute the entire ML pipeline from data ingestion to model deployment:
python main.pyfrom networksecurity.pipeline.training_pipeline import TrainingPipeline
# Initialize and run pipeline
pipeline = TrainingPipeline()
pipeline.run_pipeline()python app.pypython push_data.pyBuild Docker image:
docker build -t networksecurity .Run container:
docker run -p 8080:8080 networksecuritySetup Docker on EC2:
# Update system
sudo apt-get update -y
sudo apt-get upgrade
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
# Add user to docker group
sudo usermod -aG docker ubuntu
newgrp dockerSet up the following secrets in your GitHub repository:
AWS_ACCESS_KEY_ID- Your AWS access keyAWS_SECRET_ACCESS_KEY- Your AWS secret keyAWS_REGION- AWS region (e.g., us-east-1)AWS_ECR_LOGIN_URI- ECR login URIECR_REPOSITORY_NAME- ECR repository name
networksecurity/
βββ networksecurity/ # Main package
β βββ components/ # Pipeline components
β β βββ data_ingestion.py
β β βββ data_validation.py
β β βββ data_transformation.py
β β βββ model_trainer.py
β β βββ model_evaluation.py
β β βββ model_pusher.py
β βββ entity/ # Configuration entities
β β βββ config_entity.py
β β βββ artifact_entity.py
β βββ pipeline/ # Pipeline orchestration
β β βββ training_pipeline.py
β βββ exception/ # Custom exceptions
β βββ logging/ # Logging utilities
β βββ constant/ # Constants
β βββ utils/ # Utility functions
β βββ cloud/ # Cloud integration
βββ Artifacts/ # Generated artifacts
βββ logs/ # Application logs
βββ final_model/ # Deployed models
βββ templates/ # Web UI templates
βββ main.py # Pipeline execution
βββ app.py # Flask application
βββ push_data.py # Data upload script
βββ setup.py # Package setup
βββ requirements.txt # Dependencies
βββ Dockerfile # Docker configuration
βββ README.md # Documentation
- Data Ingestion: Fetch data from MongoDB
- Data Validation: Validate data quality and schema
- Data Transformation: Preprocess and transform features
- Model Training: Train ML models with hyperparameter tuning
- Model Evaluation: Evaluate and validate model performance
- Decision: Accept or reject model based on criteria
- Model Deployment: Push accepted models to cloud (AWS/Azure)
All components generate artifacts stored in the Artifacts/ directory:
- Raw data files
- Validation reports
- Transformed datasets
- Trained models
- Evaluation metrics
- Deployment logs
- Python 3.8+ - Core programming language
- MongoDB - Data storage
- Scikit-learn - Machine learning
- Pandas - Data manipulation
- Flask - Web application
- Docker - Containerization
- AWS/Azure - Cloud deployment
- GitHub Actions - CI/CD
Comprehensive logging is implemented throughout the pipeline:
- Component-level logs
- Error tracking
- Artifact generation logs
- All logs stored in
logs/directory
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is part of a network security ML initiative.
Harsh Raj
- Email: [email protected]
Note: Ensure all cloud credentials and database connections are properly configured before running the pipeline. The model pusher component will only deploy models that pass the evaluation criteria.
