Skip to content

melisasvr/AI-Driven-Malware-Analysis-and-Classification

Repository files navigation

πŸ›‘οΈ AI-Driven Malware Analysis and Classification

A deep learning-based malware analysis tool that uses PyTorch to classify malware families through static analysis and behavioral pattern recognition.

Python PyTorch License

πŸ“‹ Table of Contents

✨ Features

  • Deep Learning Classification: Hybrid CNN architecture combining static features and byte sequences
  • Multi-Family Detection: Classifies 5 malware families (Trojan, Ransomware, Worm, Spyware, Benign)
  • PE File Analysis: Extracts features from Windows PE executables
  • Entropy Analysis: Detects packed/encrypted malware
  • String Analysis: Identifies suspicious keywords and patterns
  • Interactive Dashboard: Beautiful HTML dashboard with real-time charts
  • High Accuracy: Achieves 93%+ validation accuracy
  • No External APIs: Fully offline analysis

πŸš€ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Required Dependencies

pip install torch numpy pandas scikit-learn

Optional Dependencies

# For disassembly analysis (optional)
pip install capstone

⚑ Quick Start

1. Generate Mock Dataset

  • Create a synthetic malware dataset for testing:
python mock_dataset_generator.py --samples 100

This creates:

dataset/
β”œβ”€β”€ trojan/       (100 samples)
β”œβ”€β”€ ransomware/   (100 samples)
β”œβ”€β”€ worm/         (100 samples)
β”œβ”€β”€ spyware/      (100 samples)
└── benign/       (100 samples)

2. Train the Model

  • Train the classifier on the generated dataset:
python main.py

Expected output:

============================================================
AI-Driven Malware Analysis Tool (PyTorch Edition)
============================================================
Using device: cpu

πŸ“‚ Loading dataset...
   βœ“ Loaded 100 trojan samples
   βœ“ Loaded 100 ransomware samples
   ...

Starting training...
Epoch [30/30] | Train Acc: 85.50% | Val Acc: 93.00%

βœ“ Training complete! Model saved as 'malware_classifier'
βœ“ Dashboard generated: index.html

3. View Results

  • Open index.html in your browser to see:
  • Performance metrics (Accuracy, Precision, Recall, F1-Score)
  • Training history charts
  • Confusion matrix
  • Per-class performance
  • Entropy analysis

4. Analyze New Files

from main import MalwareAnalyzer

# Load trained model
analyzer = MalwareAnalyzer()
analyzer.classifier.load_model("malware_classifier")

# Analyze a file
report = analyzer.analyze_file("suspicious_file.exe")

# Export report
analyzer.export_report(report, "analysis_report.json")

πŸ“ Project Structure

AI-Driven-Malware-Analysis/
β”‚
β”œβ”€β”€ main.py                      # Main malware analyzer
β”œβ”€β”€ mock_dataset_generator.py    # Generate synthetic malware samples
β”œβ”€β”€ dashboard_generator.py       # Auto-generate HTML dashboard
β”œβ”€β”€ index.html                   # Interactive dashboard (auto-generated)
β”œβ”€β”€ README.md                    # This file
β”‚
β”œβ”€β”€ dataset/                     # Training dataset
β”‚   β”œβ”€β”€ trojan/
β”‚   β”œβ”€β”€ ransomware/
β”‚   β”œβ”€β”€ worm/
β”‚   β”œβ”€β”€ spyware/
β”‚   └── benign/
β”‚
β”œβ”€β”€ test_samples/                # Test samples for analysis
β”‚
β”œβ”€β”€ malware_classifier_model.pth        # Trained model weights
└── malware_classifier_preprocessors.pkl # Scalers and encoders

πŸ’» Usage

Training a Model

from main import MalwareAnalyzer
from pathlib import Path

analyzer = MalwareAnalyzer()

# Load dataset
features_list = []
labels = []

for family_dir in Path("dataset").iterdir():
    if family_dir.is_dir():
        for file_path in family_dir.glob("*"):
            features = analyzer.pe_analyzer.extract_features(file_path)
            if features:
                features_list.append(features)
                labels.append(family_dir.name)

# Train
analyzer.classifier.train(features_list, labels, epochs=30)

# Save
analyzer.classifier.save_model("my_model")

Analyzing Files

# Single file analysis
report = analyzer.analyze_file("sample.exe")

print(f"Predicted: {report['predicted_class']}")
print(f"Confidence: {report['confidence']:.2%}")
print(f"Entropy: {report['entropy']:.2f}")

Batch Analysis

# Analyze entire directory
results = analyzer.batch_analyze("malware_samples/", recursive=True)

# Export all results
for i, report in enumerate(results):
    analyzer.export_report(report, f"report_{i}.json")

🧠 Model Architecture

Hybrid CNN Architecture

  • The classifier uses a hybrid deep learning architecture:

Static Features Branch:

  • Input: 7 numerical features (file size, entropy, string counts, etc.)
  • Architecture: Dense layers (64 β†’ 32 neurons)
  • Activation: ReLU with Dropout (0.3)

Byte Sequence Branch:

  • Input: 256-byte histogram
  • Architecture: 1D CNN (64 β†’ 32 filters)
  • Pooling: MaxPooling + AdaptiveMaxPooling

Merged Classification:

  • Combined features: 64 dimensions
  • Output: 5 classes (softmax)
  • Optimizer: Adam
  • Loss: Cross-Entropy

Feature Extraction

  • PE Header Parsing: DOS/PE headers, sections, characteristics
  • Entropy Calculation: Shannon entropy for packing detection
  • String Analysis: ASCII/Unicode strings with suspicious keyword matching
  • Byte Histogram: Frequency distribution of byte values
  • N-gram Analysis: Bigram patterns

πŸ“Š Dashboard

  • The auto-generated HTML dashboard includes:

Visualizations

  • Training History: Loss and accuracy curves
  • Confusion Matrix: Classification accuracy per family
  • Class Distribution: Dataset balance
  • Entropy Analysis: Distribution across families
  • ROC Curves: Per-class performance

Metrics

  • Overall accuracy, precision, recall, F1-score
  • Per-class performance breakdown
  • Recent analysis results
  • Training summary statistics

Features

  • Responsive design (mobile-friendly)
  • Interactive charts (Chart.js)
  • Real-time data updates
  • Gradient animations
  • Fully self-contained (offline-capable)

πŸ“¦ Dataset

Mock Dataset

  • The included mock dataset generator creates synthetic samples with realistic characteristics:

  • Trojan: Suspicious API calls, remote thread creation, registry modifications

  • Ransomware: High entropy (simulated encryption), ransom notes, crypto keywords

  • Worm: Network strings, propagation indicators, scanning patterns

  • Spyware: Keylogger strings, screenshot APIs, monitoring functions

  • Benign: Low entropy, legitimate system strings

Real Malware Samples

For production use, obtain real malware from:

⚠️ Warning: Handle real malware samples in isolated environments only!

πŸ“ˆ Results

Model Performance (Mock Dataset)

Metric Value
Overall Accuracy 93.00%
Precision 93.20%
Recall 93.00%
F1-Score 93.00%

Per-Class Performance

Class Precision Recall F1-Score Support
Benign 100% 100% 100% 20
Ransomware 100% 100% 100% 20
Spyware 78% 90% 84% 20
Trojan 100% 100% 100% 20
Worm 88% 75% 81% 20

Training Details

  • Dataset Size: 500 samples (100 per class)
  • Training Split: 80% train, 20% validation
  • Epochs: 30
  • Batch Size: 32
  • Optimizer: Adam (lr=0.001)
  • Training Time: ~5-10 minutes (CPU)

πŸ”§ Advanced Usage

Custom Feature Extraction

class CustomAnalyzer(PEAnalyzer):
    def extract_custom_features(self, file_path):
        features = super().extract_features(file_path)
        # Add your custom features
        features['custom_metric'] = self.calculate_custom_metric()
        return features

GPU Acceleration

# Automatic GPU detection
classifier = MalwareClassifier(device='cuda')  # or 'cpu'

Model Export

# Save model
classifier.save_model("production_model")

# Load model
classifier.load_model("production_model")

🀝 Contributing

  • Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

git clone https://github.com/yourusername/malware-analysis.git
cd malware-analysis
pip install -r requirements.txt

Areas for Contribution

  • Add dynamic analysis capabilities
  • Implement YARA rule integration
  • Add support for more file formats (ELF, Mach-O)
  • Improve feature extraction
  • Add explainability (LIME, SHAP)

πŸ“„ License

  • This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Disclaimer

  • This tool is for educational and research purposes only. The authors are not responsible for any misuse or damage caused by this program. Always handle malware samples in isolated, controlled environments.

πŸ™ Acknowledgments

  • PyTorch team for the deep learning framework
  • scikit-learn for machine learning utilities
  • Chart.js for dashboard visualizations
  • Capstone for disassembly engine

πŸ“§ Contact

  • For questions or suggestions, please open an issue on GitHub.

Built with ❀️ for cybersecurity research

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published