A deep learning-based malware analysis tool that uses PyTorch to classify malware families through static analysis and behavioral pattern recognition.
- Features
- Installation
- Quick Start
- Project Structure
- Usage
- Model Architecture
- Dashboard
- Dataset
- Results
- Contributing
- License
- Deep Learning Classification: Hybrid CNN architecture combining static features and byte sequences
- Multi-Family Detection: Classifies 5 malware families (Trojan, Ransomware, Worm, Spyware, Benign)
- PE File Analysis: Extracts features from Windows PE executables
- Entropy Analysis: Detects packed/encrypted malware
- String Analysis: Identifies suspicious keywords and patterns
- Interactive Dashboard: Beautiful HTML dashboard with real-time charts
- High Accuracy: Achieves 93%+ validation accuracy
- No External APIs: Fully offline analysis
- Python 3.8 or higher
- pip package manager
pip install torch numpy pandas scikit-learn# For disassembly analysis (optional)
pip install capstone- Create a synthetic malware dataset for testing:
python mock_dataset_generator.py --samples 100This creates:
dataset/
βββ trojan/ (100 samples)
βββ ransomware/ (100 samples)
βββ worm/ (100 samples)
βββ spyware/ (100 samples)
βββ benign/ (100 samples)
- Train the classifier on the generated dataset:
python main.pyExpected output:
============================================================
AI-Driven Malware Analysis Tool (PyTorch Edition)
============================================================
Using device: cpu
π Loading dataset...
β Loaded 100 trojan samples
β Loaded 100 ransomware samples
...
Starting training...
Epoch [30/30] | Train Acc: 85.50% | Val Acc: 93.00%
β Training complete! Model saved as 'malware_classifier'
β Dashboard generated: index.html
- Open
index.htmlin your browser to see: - Performance metrics (Accuracy, Precision, Recall, F1-Score)
- Training history charts
- Confusion matrix
- Per-class performance
- Entropy analysis
from main import MalwareAnalyzer
# Load trained model
analyzer = MalwareAnalyzer()
analyzer.classifier.load_model("malware_classifier")
# Analyze a file
report = analyzer.analyze_file("suspicious_file.exe")
# Export report
analyzer.export_report(report, "analysis_report.json")AI-Driven-Malware-Analysis/
β
βββ main.py # Main malware analyzer
βββ mock_dataset_generator.py # Generate synthetic malware samples
βββ dashboard_generator.py # Auto-generate HTML dashboard
βββ index.html # Interactive dashboard (auto-generated)
βββ README.md # This file
β
βββ dataset/ # Training dataset
β βββ trojan/
β βββ ransomware/
β βββ worm/
β βββ spyware/
β βββ benign/
β
βββ test_samples/ # Test samples for analysis
β
βββ malware_classifier_model.pth # Trained model weights
βββ malware_classifier_preprocessors.pkl # Scalers and encoders
from main import MalwareAnalyzer
from pathlib import Path
analyzer = MalwareAnalyzer()
# Load dataset
features_list = []
labels = []
for family_dir in Path("dataset").iterdir():
if family_dir.is_dir():
for file_path in family_dir.glob("*"):
features = analyzer.pe_analyzer.extract_features(file_path)
if features:
features_list.append(features)
labels.append(family_dir.name)
# Train
analyzer.classifier.train(features_list, labels, epochs=30)
# Save
analyzer.classifier.save_model("my_model")# Single file analysis
report = analyzer.analyze_file("sample.exe")
print(f"Predicted: {report['predicted_class']}")
print(f"Confidence: {report['confidence']:.2%}")
print(f"Entropy: {report['entropy']:.2f}")# Analyze entire directory
results = analyzer.batch_analyze("malware_samples/", recursive=True)
# Export all results
for i, report in enumerate(results):
analyzer.export_report(report, f"report_{i}.json")- The classifier uses a hybrid deep learning architecture:
Static Features Branch:
- Input: 7 numerical features (file size, entropy, string counts, etc.)
- Architecture: Dense layers (64 β 32 neurons)
- Activation: ReLU with Dropout (0.3)
Byte Sequence Branch:
- Input: 256-byte histogram
- Architecture: 1D CNN (64 β 32 filters)
- Pooling: MaxPooling + AdaptiveMaxPooling
Merged Classification:
- Combined features: 64 dimensions
- Output: 5 classes (softmax)
- Optimizer: Adam
- Loss: Cross-Entropy
- PE Header Parsing: DOS/PE headers, sections, characteristics
- Entropy Calculation: Shannon entropy for packing detection
- String Analysis: ASCII/Unicode strings with suspicious keyword matching
- Byte Histogram: Frequency distribution of byte values
- N-gram Analysis: Bigram patterns
- The auto-generated HTML dashboard includes:
- Training History: Loss and accuracy curves
- Confusion Matrix: Classification accuracy per family
- Class Distribution: Dataset balance
- Entropy Analysis: Distribution across families
- ROC Curves: Per-class performance
- Overall accuracy, precision, recall, F1-score
- Per-class performance breakdown
- Recent analysis results
- Training summary statistics
- Responsive design (mobile-friendly)
- Interactive charts (Chart.js)
- Real-time data updates
- Gradient animations
- Fully self-contained (offline-capable)
-
The included mock dataset generator creates synthetic samples with realistic characteristics:
-
Trojan: Suspicious API calls, remote thread creation, registry modifications
-
Ransomware: High entropy (simulated encryption), ransom notes, crypto keywords
-
Worm: Network strings, propagation indicators, scanning patterns
-
Spyware: Keylogger strings, screenshot APIs, monitoring functions
-
Benign: Low entropy, legitimate system strings
For production use, obtain real malware from:
- VirusShare: https://virusshare.com/
- MalwareBazaar: https://bazaar.abuse.ch/
- EMBER Dataset: Endgame Malware Benchmark
- SOREL-20M: Sophos-ReversingLabs dataset
| Metric | Value |
|---|---|
| Overall Accuracy | 93.00% |
| Precision | 93.20% |
| Recall | 93.00% |
| F1-Score | 93.00% |
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Benign | 100% | 100% | 100% | 20 |
| Ransomware | 100% | 100% | 100% | 20 |
| Spyware | 78% | 90% | 84% | 20 |
| Trojan | 100% | 100% | 100% | 20 |
| Worm | 88% | 75% | 81% | 20 |
- Dataset Size: 500 samples (100 per class)
- Training Split: 80% train, 20% validation
- Epochs: 30
- Batch Size: 32
- Optimizer: Adam (lr=0.001)
- Training Time: ~5-10 minutes (CPU)
class CustomAnalyzer(PEAnalyzer):
def extract_custom_features(self, file_path):
features = super().extract_features(file_path)
# Add your custom features
features['custom_metric'] = self.calculate_custom_metric()
return features# Automatic GPU detection
classifier = MalwareClassifier(device='cuda') # or 'cpu'# Save model
classifier.save_model("production_model")
# Load model
classifier.load_model("production_model")- Contributions are welcome! Please feel free to submit a Pull Request.
git clone https://github.com/yourusername/malware-analysis.git
cd malware-analysis
pip install -r requirements.txt- Add dynamic analysis capabilities
- Implement YARA rule integration
- Add support for more file formats (ELF, Mach-O)
- Improve feature extraction
- Add explainability (LIME, SHAP)
- This project is licensed under the MIT License - see the LICENSE file for details.
- This tool is for educational and research purposes only. The authors are not responsible for any misuse or damage caused by this program. Always handle malware samples in isolated, controlled environments.
- PyTorch team for the deep learning framework
- scikit-learn for machine learning utilities
- Chart.js for dashboard visualizations
- Capstone for disassembly engine
- For questions or suggestions, please open an issue on GitHub.
Built with β€οΈ for cybersecurity research