Skip to content

Harshraj112/networkSecurity

Repository files navigation

Network Security ML Pipeline

A comprehensive end-to-end machine learning pipeline for network security and phishing detection, implementing modular components for data processing, model training, evaluation, and deployment.

ML Pipeline Workflow

πŸ“‹ Table of Contents

🎯 Overview

This project implements a production-ready machine learning pipeline for network security analysis. The system automatically processes network data, trains models, evaluates performance, and deploys the best-performing models to cloud storage (AWS/Azure).

Key Features:

  • Modular component-based architecture
  • Automated data validation and transformation
  • Model evaluation with acceptance criteria
  • Cloud deployment integration
  • Comprehensive logging and artifact management
  • MongoDB integration for data storage

πŸ—οΈ Pipeline Architecture

The ML pipeline consists of six main components that work sequentially:

1. Data Ingestion Component

  • Configuration: Data Ingestion Config
  • Purpose: Fetches network security data from MongoDB database
  • Outputs: Data Ingestion Artifacts containing raw data for processing

2. Data Validation Component

  • Configuration: Data Validation Config
  • Purpose: Validates data quality, schema, and integrity
  • Checks: Missing values, data drift, schema compliance
  • Outputs: Data Validation Artifacts with validation reports

3. Data Transformation Component

  • Configuration: Data Transformation Config
  • Purpose: Preprocesses and transforms data for model training
  • Operations: Feature engineering, encoding, scaling, splitting
  • Outputs: Data Transformation Artifacts with processed datasets

4. Model Trainer Component

  • Configuration: Model Trainer Config
  • Purpose: Trains machine learning models on processed data
  • Outputs: Model Trainer Artifacts containing trained models and metrics

5. Model Evaluation Component

  • Configuration: Model Evaluation Config
  • Purpose: Evaluates model performance against acceptance criteria
  • Decision Point: Determines if model is accepted or rejected
  • Outputs: Model Evaluation Artifacts with performance metrics

6. Model Pusher Component

  • Configuration: Model Pusher Config
  • Purpose: Deploys accepted models to cloud storage
  • Deployment: Pushes models to AWS S3 or Azure Blob Storage
  • Outputs: Deployed model accessible for inference

πŸ”§ Components

Data Ingestion

from networksecurity.components.data_ingestion import DataIngestion
  • Connects to MongoDB database
  • Extracts network security dataset
  • Stores raw data for validation

Data Validation

from networksecurity.components.data_validation import DataValidation
  • Validates data schema
  • Checks for missing values
  • Detects data drift
  • Generates validation reports

Data Transformation

from networksecurity.components.data_transformation import DataTransformation
  • Feature engineering
  • Categorical encoding
  • Feature scaling
  • Train-test splitting

Model Trainer

from networksecurity.components.model_trainer import ModelTrainer
  • Trains ML models
  • Performs hyperparameter tuning
  • Generates training metrics
  • Saves trained models

Model Evaluation

from networksecurity.components.model_evaluation import ModelEvaluation
  • Evaluates model performance
  • Compares against baseline
  • Applies acceptance criteria
  • Determines deployment eligibility

Model Pusher

from networksecurity.components.model_pusher import ModelPusher
  • Deploys accepted models
  • Uploads to cloud storage
  • Manages model versioning

βš™οΈ Configuration Files

Each component is configured through dedicated configuration files:

  • πŸ“„ Data Ingestion Config - Database connection, collection settings
  • πŸ“„ Data Validation Config - Validation rules, drift thresholds
  • πŸ“„ Data Transformation Config - Preprocessing parameters
  • πŸ“„ Model Trainer Config - Model hyperparameters, algorithms
  • πŸ“„ Model Evaluation Config - Acceptance criteria, metrics
  • πŸ“„ Model Pusher Config - Cloud credentials, deployment settings

πŸ“¦ Installation

Prerequisites

  • Python 3.8+
  • MongoDB
  • AWS/Azure account (for deployment)

Setup

  1. Clone the repository
git clone <repository-url>
cd networksecurity
  1. Install dependencies
pip install -r requirements.txt
  1. Install the package
pip install -e .
  1. Configure environment variables
# MongoDB
MONGO_DB_URL=<your-mongodb-url>

# AWS (Optional)
AWS_ACCESS_KEY_ID=<your-access-key>
AWS_SECRET_ACCESS_KEY=<your-secret-key>
AWS_REGION=us-east-1
AWS_ECR_LOGIN_URI=<your-ecr-uri>
ECR_REPOSITORY_NAME=networkssecurity

πŸš€ Usage

Running the Complete Pipeline

Execute the entire ML pipeline from data ingestion to model deployment:

python main.py

Running Individual Components

from networksecurity.pipeline.training_pipeline import TrainingPipeline

# Initialize and run pipeline
pipeline = TrainingPipeline()
pipeline.run_pipeline()

Making Predictions

python app.py

Pushing Data to MongoDB

python push_data.py

🌐 Deployment

Docker Setup

Build Docker image:

docker build -t networksecurity .

Run container:

docker run -p 8080:8080 networksecurity

EC2 Deployment

Setup Docker on EC2:

# Update system
sudo apt-get update -y
sudo apt-get upgrade

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Add user to docker group
sudo usermod -aG docker ubuntu
newgrp docker

GitHub Secrets Configuration

Set up the following secrets in your GitHub repository:

  • AWS_ACCESS_KEY_ID - Your AWS access key
  • AWS_SECRET_ACCESS_KEY - Your AWS secret key
  • AWS_REGION - AWS region (e.g., us-east-1)
  • AWS_ECR_LOGIN_URI - ECR login URI
  • ECR_REPOSITORY_NAME - ECR repository name

πŸ“ Project Structure

networksecurity/
β”œβ”€β”€ networksecurity/               # Main package
β”‚   β”œβ”€β”€ components/               # Pipeline components
β”‚   β”‚   β”œβ”€β”€ data_ingestion.py
β”‚   β”‚   β”œβ”€β”€ data_validation.py
β”‚   β”‚   β”œβ”€β”€ data_transformation.py
β”‚   β”‚   β”œβ”€β”€ model_trainer.py
β”‚   β”‚   β”œβ”€β”€ model_evaluation.py
β”‚   β”‚   └── model_pusher.py
β”‚   β”œβ”€β”€ entity/                   # Configuration entities
β”‚   β”‚   β”œβ”€β”€ config_entity.py
β”‚   β”‚   └── artifact_entity.py
β”‚   β”œβ”€β”€ pipeline/                 # Pipeline orchestration
β”‚   β”‚   └── training_pipeline.py
β”‚   β”œβ”€β”€ exception/                # Custom exceptions
β”‚   β”œβ”€β”€ logging/                  # Logging utilities
β”‚   β”œβ”€β”€ constant/                 # Constants
β”‚   β”œβ”€β”€ utils/                    # Utility functions
β”‚   └── cloud/                    # Cloud integration
β”œβ”€β”€ Artifacts/                    # Generated artifacts
β”œβ”€β”€ logs/                         # Application logs
β”œβ”€β”€ final_model/                  # Deployed models
β”œβ”€β”€ templates/                    # Web UI templates
β”œβ”€β”€ main.py                       # Pipeline execution
β”œβ”€β”€ app.py                        # Flask application
β”œβ”€β”€ push_data.py                  # Data upload script
β”œβ”€β”€ setup.py                      # Package setup
β”œβ”€β”€ requirements.txt              # Dependencies
β”œβ”€β”€ Dockerfile                    # Docker configuration
└── README.md                     # Documentation

πŸ”„ Workflow

  1. Data Ingestion: Fetch data from MongoDB
  2. Data Validation: Validate data quality and schema
  3. Data Transformation: Preprocess and transform features
  4. Model Training: Train ML models with hyperparameter tuning
  5. Model Evaluation: Evaluate and validate model performance
  6. Decision: Accept or reject model based on criteria
  7. Model Deployment: Push accepted models to cloud (AWS/Azure)

πŸ“Š Artifacts

All components generate artifacts stored in the Artifacts/ directory:

  • Raw data files
  • Validation reports
  • Transformed datasets
  • Trained models
  • Evaluation metrics
  • Deployment logs

πŸ› οΈ Technologies Used

  • Python 3.8+ - Core programming language
  • MongoDB - Data storage
  • Scikit-learn - Machine learning
  • Pandas - Data manipulation
  • Flask - Web application
  • Docker - Containerization
  • AWS/Azure - Cloud deployment
  • GitHub Actions - CI/CD

πŸ“ Logging

Comprehensive logging is implemented throughout the pipeline:

  • Component-level logs
  • Error tracking
  • Artifact generation logs
  • All logs stored in logs/ directory

🀝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

πŸ“„ License

This project is part of a network security ML initiative.

πŸ‘€ Author

Harsh Raj


Note: Ensure all cloud credentials and database connections are properly configured before running the pipeline. The model pusher component will only deploy models that pass the evaluation criteria.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages