🛡️ AI-Driven Malware Analysis and Classification

A deep learning-based malware analysis tool that uses PyTorch to classify malware families through static analysis and behavioral pattern recognition.

📋 Table of Contents

Features
Installation
Quick Start
Project Structure
Usage
Model Architecture
Dashboard
Dataset
Results
Contributing
License

✨ Features

Deep Learning Classification: Hybrid CNN architecture combining static features and byte sequences
Multi-Family Detection: Classifies 5 malware families (Trojan, Ransomware, Worm, Spyware, Benign)
PE File Analysis: Extracts features from Windows PE executables
Entropy Analysis: Detects packed/encrypted malware
String Analysis: Identifies suspicious keywords and patterns
Interactive Dashboard: Beautiful HTML dashboard with real-time charts
High Accuracy: Achieves 93%+ validation accuracy
No External APIs: Fully offline analysis

🚀 Installation

Prerequisites

Python 3.8 or higher
pip package manager

Required Dependencies

pip install torch numpy pandas scikit-learn

Optional Dependencies

# For disassembly analysis (optional)
pip install capstone

⚡ Quick Start

1. Generate Mock Dataset

Create a synthetic malware dataset for testing:

python mock_dataset_generator.py --samples 100

This creates:

dataset/
├── trojan/       (100 samples)
├── ransomware/   (100 samples)
├── worm/         (100 samples)
├── spyware/      (100 samples)
└── benign/       (100 samples)

2. Train the Model

Train the classifier on the generated dataset:

python main.py

Expected output:

============================================================
AI-Driven Malware Analysis Tool (PyTorch Edition)
============================================================
Using device: cpu

📂 Loading dataset...
   ✓ Loaded 100 trojan samples
   ✓ Loaded 100 ransomware samples
   ...

Starting training...
Epoch [30/30] | Train Acc: 85.50% | Val Acc: 93.00%

✓ Training complete! Model saved as 'malware_classifier'
✓ Dashboard generated: index.html

3. View Results

Open index.html in your browser to see:
Performance metrics (Accuracy, Precision, Recall, F1-Score)
Training history charts
Confusion matrix
Per-class performance
Entropy analysis

4. Analyze New Files

from main import MalwareAnalyzer

# Load trained model
analyzer = MalwareAnalyzer()
analyzer.classifier.load_model("malware_classifier")

# Analyze a file
report = analyzer.analyze_file("suspicious_file.exe")

# Export report
analyzer.export_report(report, "analysis_report.json")

📁 Project Structure

AI-Driven-Malware-Analysis/
│
├── main.py                      # Main malware analyzer
├── mock_dataset_generator.py    # Generate synthetic malware samples
├── dashboard_generator.py       # Auto-generate HTML dashboard
├── index.html                   # Interactive dashboard (auto-generated)
├── README.md                    # This file
│
├── dataset/                     # Training dataset
│   ├── trojan/
│   ├── ransomware/
│   ├── worm/
│   ├── spyware/
│   └── benign/
│
├── test_samples/                # Test samples for analysis
│
├── malware_classifier_model.pth        # Trained model weights
└── malware_classifier_preprocessors.pkl # Scalers and encoders

💻 Usage

Training a Model

from main import MalwareAnalyzer
from pathlib import Path

analyzer = MalwareAnalyzer()

# Load dataset
features_list = []
labels = []

for family_dir in Path("dataset").iterdir():
    if family_dir.is_dir():
        for file_path in family_dir.glob("*"):
            features = analyzer.pe_analyzer.extract_features(file_path)
            if features:
                features_list.append(features)
                labels.append(family_dir.name)

# Train
analyzer.classifier.train(features_list, labels, epochs=30)

# Save
analyzer.classifier.save_model("my_model")

Analyzing Files

# Single file analysis
report = analyzer.analyze_file("sample.exe")

print(f"Predicted: {report['predicted_class']}")
print(f"Confidence: {report['confidence']:.2%}")
print(f"Entropy: {report['entropy']:.2f}")

Batch Analysis

# Analyze entire directory
results = analyzer.batch_analyze("malware_samples/", recursive=True)

# Export all results
for i, report in enumerate(results):
    analyzer.export_report(report, f"report_{i}.json")

🧠 Model Architecture

Hybrid CNN Architecture

The classifier uses a hybrid deep learning architecture:

Static Features Branch:

Input: 7 numerical features (file size, entropy, string counts, etc.)
Architecture: Dense layers (64 → 32 neurons)
Activation: ReLU with Dropout (0.3)

Byte Sequence Branch:

Input: 256-byte histogram
Architecture: 1D CNN (64 → 32 filters)
Pooling: MaxPooling + AdaptiveMaxPooling

Merged Classification:

Combined features: 64 dimensions
Output: 5 classes (softmax)
Optimizer: Adam
Loss: Cross-Entropy

Feature Extraction

PE Header Parsing: DOS/PE headers, sections, characteristics
Entropy Calculation: Shannon entropy for packing detection
String Analysis: ASCII/Unicode strings with suspicious keyword matching
Byte Histogram: Frequency distribution of byte values
N-gram Analysis: Bigram patterns

📊 Dashboard

The auto-generated HTML dashboard includes:

Visualizations

Training History: Loss and accuracy curves
Confusion Matrix: Classification accuracy per family
Class Distribution: Dataset balance
Entropy Analysis: Distribution across families
ROC Curves: Per-class performance

Metrics

Overall accuracy, precision, recall, F1-score
Per-class performance breakdown
Recent analysis results
Training summary statistics

Features

Responsive design (mobile-friendly)
Interactive charts (Chart.js)
Real-time data updates
Gradient animations
Fully self-contained (offline-capable)

📦 Dataset

Mock Dataset

The included mock dataset generator creates synthetic samples with realistic characteristics:
Trojan: Suspicious API calls, remote thread creation, registry modifications
Ransomware: High entropy (simulated encryption), ransom notes, crypto keywords
Worm: Network strings, propagation indicators, scanning patterns
Spyware: Keylogger strings, screenshot APIs, monitoring functions
Benign: Low entropy, legitimate system strings

Real Malware Samples

For production use, obtain real malware from:

VirusShare: https://virusshare.com/
MalwareBazaar: https://bazaar.abuse.ch/
EMBER Dataset: Endgame Malware Benchmark
SOREL-20M: Sophos-ReversingLabs dataset

⚠️ Warning: Handle real malware samples in isolated environments only!

📈 Results

Model Performance (Mock Dataset)

Metric	Value
Overall Accuracy	93.00%
Precision	93.20%
Recall	93.00%
F1-Score	93.00%

Per-Class Performance

Class	Precision	Recall	F1-Score	Support
Benign	100%	100%	100%	20
Ransomware	100%	100%	100%	20
Spyware	78%	90%	84%	20
Trojan	100%	100%	100%	20
Worm	88%	75%	81%	20

Training Details

Dataset Size: 500 samples (100 per class)
Training Split: 80% train, 20% validation
Epochs: 30
Batch Size: 32
Optimizer: Adam (lr=0.001)
Training Time: ~5-10 minutes (CPU)

🔧 Advanced Usage

Custom Feature Extraction

class CustomAnalyzer(PEAnalyzer):
    def extract_custom_features(self, file_path):
        features = super().extract_features(file_path)
        # Add your custom features
        features['custom_metric'] = self.calculate_custom_metric()
        return features

GPU Acceleration

# Automatic GPU detection
classifier = MalwareClassifier(device='cuda')  # or 'cpu'

Model Export

# Save model
classifier.save_model("production_model")

# Load model
classifier.load_model("production_model")

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

git clone https://github.com/yourusername/malware-analysis.git
cd malware-analysis
pip install -r requirements.txt

Areas for Contribution

Add dynamic analysis capabilities
Implement YARA rule integration
Add support for more file formats (ELF, Mach-O)
Improve feature extraction
Add explainability (LIME, SHAP)

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Disclaimer

This tool is for educational and research purposes only. The authors are not responsible for any misuse or damage caused by this program. Always handle malware samples in isolated, controlled environments.

🙏 Acknowledgments

PyTorch team for the deep learning framework
scikit-learn for machine learning utilities
Chart.js for dashboard visualizations
Capstone for disassembly engine

📧 Contact

For questions or suggestions, please open an issue on GitHub.

Built with ❤️ for cybersecurity research

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
analysis report		analysis report
dataset		dataset
malware_classifier		malware_classifier
test_samples		test_samples
README.md		README.md
dashboard_generator.py		dashboard_generator.py
index.html		index.html
main.py		main.py
mock_dataset_generator.py		mock_dataset_generator.py
test_analyzer.py		test_analyzer.py

melisasvr/AI-Driven-Malware-Analysis-and-Classification

Folders and files

Latest commit

History

Repository files navigation