Skip to content

dv-gorasiya/analysis-of-malware-infected-systems-with-hadoop

Repository files navigation

In this section details of each data processing phases after cleaning is discussed bellow as presented in Figure 1. Big data platform for the project is built on cloud with 16GB of RAM, 8 VCPUs and 160GB of Disk space.

  1. Data Loading to Relational MySQL Server:

Initially DDL is created with new database to match the incoming data structure. Now cleaned data file is pushed to MySQL with LOAD DATA INFILE utility. This utility creates a connection to the specified database and loads data from file to the table. Once data is loaded in MySQL, Hadoop services are started to receive a connection from Sqoop.

  1. Sqoop Import to Non-Relational HDFS:

Sqoop is built to migrate data from Relational databases to Non-relational databases. Sqoop with JDBC connection to MySQL pulls data from specified server/table and loads into HDFS at the specified path in the set of block files. Set of these files loaded in HDFSare further utilized for data analysis and machine learning job.

  1. Data Analysis with SparkSQL, Hive, Pig Latin, MapReduce:

SparkSQL: Data is directly loaded from HDFS to Python Spark dataframe and Temp View is created on top of the dataframe for further querying and storage of spark dataframe. pyspark.sql library is used for querying result and response is stored with databrick’s spark csv package. Hive: HiveQL is used to answer two of the research questions. LOAD DATA INPATH is used to pull data from HDFS and the result is written to HDFS directory with INSERT OVERWRITE DIRECTORY command. Pig Latin: Filtering and Grouping of the data is performed with Pig Latin script to answer two of the questions. Result of queries is stored back at specified target directory in HDFS with STORE INTO utility. MapReduce: It is used with Java programming language to extract the summarized information for two research questions with the help of MapReduce filtering design pattern. Mapper is used for filtering the required records and it passes the sorted filtered iterator object to the Reducer class for summarizing input data.

  1. Loading Analysis Results in Hbase & HDFS

All the query results are stored in a non-relational database, HBase and HDFS. MapReduce, Hive and SparkSQL results are stored in HDFS whereas Pig Latin results are pushed to HBase with ’MapReduce.ImportTsv’ package of HBase.

  1. Classification with Gradient-boosted Tree:

To over come the limitation of traditional approaches of Malware classification, pySpark MLlib is used to run tree classifier on distributed Hadoop environment using ’pyspark.ml’ libraries on top of data present in HDFS. With the most basic infrastructure, Gradientboosted tree classifier in this project achieved 65% accuracy which is comparatively closer to the Malware detection competition wining accuracy of 69% on the same dataset.

  1. Automation:

One master shell script calls all the child automation scripts (.sql, .hql, .py, .pig, .sh) allowing user to hold or skip the phases in process as and when required. File and table clean-up commands are placed in scripts to clean existing files present in HDFS and tables are truncated before load restart.

About

Analysis of Malware Infected Systems & Classification with Gradient-boosted Tree on Big Data Platform.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors