Based on Awesome-Cybersecurity-Datasets, we aim to prepare a database of security datasets of ALL kinds with PREPROCESSING available :)
Let's collectively make our lives easier when we search for data to showcase our cool methods!!
We will worry about the directory structure later, but the following must be included with each note:
- The name of the dataset
- Relevant tags for the dataset (see below)
- A brief description of it
- The location or instructions on how to gain access to the dataset
- Bibtex citation for said dataset
- Pre-processing instructions for those that do not know how to use the dataset
- (Optional but preferred) PyTorch Dataset class for how to load the dataset - with Train/Val/Test split options
IMPORTANT STANDARDIZATION: This repo will only be useful if we can accurately tag the datasets for easy lookup. Below are the features of interest and if any are updated, then the entire repository must be updated for consistency...
The tags are formated to work in Obsidian, an organisational tool that can link MD files based on these tags in a cool UI.
Tags | Purpose |
---|---|
#network_traffic, #host | |
#urls, #domain_names | |
#malware, #binaries | |
#webapps, #software | |
#email, #fraud, #phishing, #passwords | |
#simulated/environment, #simulated/users, #real/attackers, #real/users |
Still a work in progress...