Guide to Running LogProcessing MapReduce program in AWS and Hortonworks

Compile the project and generate the JAR file (See Cloning and Compiling for more information).

Hortonworks

Download and Install Hortonworks Sandbox from Hortonworks Downloads on VMWare or VirtualBox.
Once installed, and the Sandbox started running, follow the instructions to connect to that machine using SSH.
After connecting to the machine using SSH, lets create required HDFS directories.
```
hadoop fs -mkdir logprocess
hadoop fs -mkdir logprocess/input
```
Copy files(input log files, and jar) from local machine to the VM by running the following command.
```
scp -p 2222 <path/input_file> root@<hortonworks_ssh_ip>:<Destination Path>
```
Transfer the files to the HDFS file-system using the following command.
```
hadoop fs -put <input_file> logprocess/input
```
Execute the Map-Reduce Jobs, See documentations of Job1, Job2, Job3 and Job4 for more information on how to execute each tasks. This is the skeleton to execute different map-reduce jobs.
```
hadoop jar LogProcessing-MapReduce-assembly [input-path] [output-path] [job-key] [pattern-key]
```
Read output of the executed MapReduce jobs
```
hadoop fs -cat logprocess/<output_dir>/*
```
To Export the output from HDFS File System to Normal file System,
```
hadoop fs -cat logprocess/<output_dir>/* > output.csv
```

Transferring output files from Remote machine to local machine.

scp -p 2222 root@<hortonworks_ssh_ip>:output.csv <dest_path_on_local_machine>

AWS

Create a Cluster in AWS EMR. (Here's a Guide: How to Create and Run EMR Cluster)
Make sure SSH port is enabled in Security settings of the Master node.
Connect to the Cluster from local machine using Putty, and the amazon key-pair.

Create HDFS directories required to run this project.

 # Commands to be run on Putty Terminal
 hadoop fs -mkdir logprocess
 hadoop fs -mkdir logprocess/input

In another terminal, connect to master node using SSH. (Note: You will need a .pem key to establish this OpenSSH connection).
```
# Command to be run on Local Terminal    
sftp -i <aws_key_file.pem> hadoop@<aws_cluster_public_dns_name>
```
Once SFTP is connected, transfer the files from your local machine to the master node using the following commands
```
 # Command to be run on Local Terminal    
 put <src_path>/*.logs <dest_path>
```

Transfer the Jar file to the master node as well.

# Command to be run on Local Terminal    
put <src_path>/*.jar <dest_path>

Store the files in HDFS Directory. Navigate to the directory, where the log files are located.
```
# PuTTy Terminal
hadoop fs -put *.log logprocess/input    
```
Execute the Map-Reduce Program. See documentations of Job1, Job2, Job3 and Job4 for more information on how to execute each tasks.
```
# PuTTy Terminal
hadoop jar LogProcessing-MapReduce-assembly [input-path] [output-path] [job-key] [pattern-key]
```
This is the pattern for running LogProcessing MapReduce programs.
Once the output is generated, we can view the results of the jobs directly in the terminal using the following command.
```
# PuTTy Terminal
hadoop fs -cat logprocess/<output_dir>/*
```

To Export the output from HDFS File System to Normal file System,

# PuTTy Terminal
hadoop fs -cat logprocess/output_dir/* > output.csv

To transfer output files from remote machine to the local machine,

# Command to be run on Local Terminal with SFTP Connection established
get <output_file_name> <local_directory_path>
# To transfer directories
get -r <output_directory_name> <local_directory_path>

<< Back to Index

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guide to Running LogProcessing MapReduce program in AWS and Hortonworks

Hortonworks

AWS

FilesExpand file tree

AWS_Hortonworks_Guide.md

Latest commit

History

AWS_Hortonworks_Guide.md

File metadata and controls

Guide to Running LogProcessing MapReduce program in AWS and Hortonworks

Hortonworks

AWS