Malware-Research

Malware research with machine learning under guidance of Professor Mark Stamp at SJSU. Results will be published in a paper and in this book on deep learning: http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=95376&copyownerid=101185

Goal: Use ensemble learning and various models to classify malware into their respective families

Process:

Extract all file names to classify and group them into their families
Use Radare2 to disassemble each file and write the opcode sequence onto text files
Create a large .csv file with all the opcode data
- in the .csv file, we use the first 1000 opcodes as features for training -remove any malware samples that do not have 1k opcoes or are corrupted
models: -classic:
- random forest
- adaboost
- xgboost
- svm
- bagged svm
- hmm
- bagged hmm
- boosted hmm
- knn
- mlp
- voting -deep learning:
- cnn
- bagged cnn
- boosted cnn
- lstm
- bagged lstm
- boosted lstm -voting:
- all bagged and boosted cnns
- all bagged and boosted lstms
- all bagged cnns and bagged lstms
- all boosted cnns and boosted lstms
- all bagged and boosted cnns and lstms
- all deep learning and classic models combined Results: -https://drive.google.com/drive/u/1/folders/1vliGOjaUDsqGVy_sq191jorfYquIj7JP

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
Adaboost		Adaboost
CNN		CNN
KNN		KNN
LSTM		LSTM
RandomForest		RandomForest
SVM		SVM
Stacking		Stacking
XGBoost		XGBoost
all_results		all_results
cm		cm
hmm		hmm
ipynbs		ipynbs
other_scripts		other_scripts
random_forest_model.sav		random_forest_model.sav
saved_classic_models		saved_classic_models
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
Research Paper.pdf		Research Paper.pdf
all_data.csv		all_data.csv
all_data2_new.csv		all_data2_new.csv
boosted_lstm_2.json		boosted_lstm_2.json
randomforestcm.png		randomforestcm.png

Provide feedback