GitHub - dv-gorasiya/analysis-of-malware-infected-systems-with-hadoop: Analysis of Malware Infected Systems & Classification with Gradient-boosted Tree on Big Data Platform.

In this section details of each data processing phases after cleaning is discussed bellow as presented in Figure 1. Big data platform for the project is built on cloud with 16GB of RAM, 8 VCPUs and 160GB of Disk space.

Data Loading to Relational MySQL Server:

Initially DDL is created with new database to match the incoming data structure. Now cleaned data file is pushed to MySQL with LOAD DATA INFILE utility. This utility creates a connection to the specified database and loads data from file to the table. Once data is loaded in MySQL, Hadoop services are started to receive a connection from Sqoop.

Sqoop Import to Non-Relational HDFS:

Sqoop is built to migrate data from Relational databases to Non-relational databases. Sqoop with JDBC connection to MySQL pulls data from specified server/table and loads into HDFS at the specified path in the set of block files. Set of these files loaded in HDFSare further utilized for data analysis and machine learning job.

Data Analysis with SparkSQL, Hive, Pig Latin, MapReduce:

SparkSQL: Data is directly loaded from HDFS to Python Spark dataframe and Temp View is created on top of the dataframe for further querying and storage of spark dataframe. pyspark.sql library is used for querying result and response is stored with databrick’s spark csv package. Hive: HiveQL is used to answer two of the research questions. LOAD DATA INPATH is used to pull data from HDFS and the result is written to HDFS directory with INSERT OVERWRITE DIRECTORY command. Pig Latin: Filtering and Grouping of the data is performed with Pig Latin script to answer two of the questions. Result of queries is stored back at specified target directory in HDFS with STORE INTO utility. MapReduce: It is used with Java programming language to extract the summarized information for two research questions with the help of MapReduce filtering design pattern. Mapper is used for filtering the required records and it passes the sorted filtered iterator object to the Reducer class for summarizing input data.

Loading Analysis Results in Hbase & HDFS

All the query results are stored in a non-relational database, HBase and HDFS. MapReduce, Hive and SparkSQL results are stored in HDFS whereas Pig Latin results are pushed to HBase with ’MapReduce.ImportTsv’ package of HBase.

Classification with Gradient-boosted Tree:

To over come the limitation of traditional approaches of Malware classification, pySpark MLlib is used to run tree classifier on distributed Hadoop environment using ’pyspark.ml’ libraries on top of data present in HDFS. With the most basic infrastructure, Gradientboosted tree classifier in this project achieved 65% accuracy which is comparatively closer to the Malware detection competition wining accuracy of 69% on the same dataset.

Automation:

One master shell script calls all the child automation scripts (.sql, .hql, .py, .pig, .sh) allowing user to hold or skip the phases in process as and when required. File and table clean-up commands are placed in scripts to clean existing files present in HDFS and tables are truncated before load restart.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
hive		hive
mapreduce_task1		mapreduce_task1
mapreduce_task2		mapreduce_task2
mysql		mysql
pig		pig
py_clean_job		py_clean_job
sparksql		sparksql
.gitattributes		.gitattributes
PDA_MASTR_AUTO.sh		PDA_MASTR_AUTO.sh
README.md		README.md
pda_proj_ddls.txt		pda_proj_ddls.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages