Skip to content

MarkCyber/CVEReapeR-ThreatOpsAI

Repository files navigation

CVEReapeR: An H2Oai Threat Intel Pipeline

Python 3.9+ License: Custom Code style: black

🔥 An end to end machine learning pipeline for CVE risk analysis. This tool takes in vulnerability data (such as NVD CVEs, CISA KEV, ExploitDB), simulates or parses log data (depending on if you have real logs to input), and then uses H2O's AutoML feature to predict and prioritize the most dangerous vulnerabilities in your environment.

👉🏼 ELI5 version: Give your smart AI friend a bunch of hacker crime reports and scary notes, then it goes “pew pew” on the bad guys (aka CVEs, a publicly disclosed "oops") so your systems don’t get robbed.


Project structure:

CVEReapeR-ThreatOpsAI/
│
├── data/            # CVE data (NVD JSONs, CISA KEV CSV, logs)
├── exploitdb/       # ExploitDB exploit metadata (CSV)
├── models/          # Trained models
├── notebooks/       # Optional jupyter notebook/jupyter lab
├── outputs/         # Generated charts and risk report
├── run_analysis.py  # Main pipeline entrypoint, run this after setting config
├── config.yaml      # Configuration for data paths and parameters as well as email
└── .gitattributes   # Git LFS tracking for large CVE JSON files

Overview

While CVEReapeR is functional, it is also a work in progress. The goal of CVEReapeR is to automate triage beyond just threat control. It contextualizes vulnerabilities based on exposure, asset type, and exploit availability to produce a prioritized, explainable list of threats tailored to your environment.

Key Features:

  • End to end workflow: Vulnerability scanner outputs and log data will return a full risk ranked report with explainability and next steps.
  • H2O AutoML: Trains, tests, and selects the best model for risk classification.
  • Simulation option: If real logs are unavailable, there is the option to simulate logs built into CVEReapeR.
  • Exploit-aware enrichment (thanks exploitdb): Joins CVE data with real-world exploit metadata from exploitdb and (in the future- CISA KEV).
  • Explainable results: Offers explainability based on feature importance scores and rule based logic.
  • Simple to read output: Markdown results that are easy to interpet, along with an email feature to provide immediate results to others when needed.

Technologies Used

  • Machine Learning: H2O.ai (AutoML), GBM, xgboost
  • Data Handling: Pandas, NumPy, YAML, JSON
  • Visualization: Matplotlib, Seaborn
  • Explainability: Feature importance and rule-based attribution
  • Reporting: Markdown + optional direct email

💧Blue Team Use Cases💧

CVEReapeR was built with defenders in mind: analysts, threat hunters, and vulnerability managers who need to understand their security posture fast.

Defensive Applications:

  • Triage automation: Prioritize vulnerabilities based on exploitability, asset exposure, and log evidence.
  • Risk reduction: Contextual recommendations to aid in patch decisions along with network segmentation.
  • Reporting: Share clean markdown reports or trigger email alerts for stakeholders.
  • Threat Hunting: Use log parsing and asset simulation to enrich vulnerability findings.

🩸Red Team Use Cases🩸

While CVEReapeR was initially designed for blue teams, its output can still be valuable for offensive teams simulating real world adversaries.

Offensive Applications:

  • Scenario planning: Identify critical CVEs to use in assumed breach or post-exploitation.
  • Exploit path prioritization: Rank vulnerable hosts by exploitability and service context.
  • Target selection for emulation: Pinpoint high-value targets for red team scenarios.
  • Payload strategy: Leverage exploit metadata to focus efforts on high-impact vulnerabilities.

Example Output

The final report shows prioritized CVEs with model explanations and visuals:

Top 5 Riskiest Hosts

a chart showing those 5 hosts here

Example CVE Prediction

host015.mil - CVE-2021-44228
AI Risk Level: Critical
Explanation: Public exploit exists, and the system is internet-exposed with log activity.
Recommended Action: Patch immediately. If delayed, isolate from exposed networks.


System Dependencies

This project requires pandoc for the report generation aspect, which can be installed via Homebrew:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install pandoc

Getting Started

1. Clone the Repository (git LFS required to install CVE list)

git lfs install
git lfs clone https://github.com/markcyber/cvereaper-threatopsai
cd cvereaper-threatopsai

2. Install Dependencies

Python 3.9+ is highly recommended.

pip install -r requirements.txt

You may also need to install H2O if you have not already:

pip install -f https://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o

3. Modify the config

nano config.yaml

4. Run the Pipeline

python run_analysis.py

Outputs are saved in the outputs/ directory.


Data Sources

All data used is publicly available, and usage complies with public/open data standards.


Notes

  • Large JSON files (>100MB) are managed using Git LFS
  • Trained model files in models/ are optional; you can remove or regenerate them
  • You can simulate log data or plug in real enterprise logs (CSV format)

License

This project is licensed under a custom non-commercial license.
See the LICENSE file for full details.


Author

Made with ❤️ by markcyber
Special focus on red teaming, cybersecurity threat intelligence, and ML-based exploit prediction.

This project was developed with assistance from gemini.

About

AI/ML-powered CVE hunting so you don't have to get your hands dirty (or your system pwned)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages