Compile the project and generate the JAR file (See Cloning and Compiling for more information).
- Download and Install Hortonworks Sandbox from Hortonworks Downloads on VMWare or VirtualBox.
- Once installed, and the Sandbox started running, follow the instructions to connect to that machine using SSH.
- After connecting to the machine using SSH, lets create required HDFS directories.
hadoop fs -mkdir logprocess hadoop fs -mkdir logprocess/input
- Copy files(input log files, and jar) from local machine to the VM by running the following command.
scp -p 2222 <path/input_file> root@<hortonworks_ssh_ip>:<Destination Path>
- Transfer the files to the HDFS file-system using the following command.
hadoop fs -put <input_file> logprocess/input
- Execute the Map-Reduce Jobs, See documentations of Job1, Job2, Job3 and Job4 for more information on how to execute each tasks.
This is the skeleton to execute different map-reduce jobs.
hadoop jar LogProcessing-MapReduce-assembly [input-path] [output-path] [job-key] [pattern-key]
- Read output of the executed MapReduce jobs
hadoop fs -cat logprocess/<output_dir>/*
- To Export the output from HDFS File System to Normal file System,
hadoop fs -cat logprocess/<output_dir>/* > output.csv
- Transferring output files from Remote machine to local machine.
scp -p 2222 root@<hortonworks_ssh_ip>:output.csv <dest_path_on_local_machine>
- Create a Cluster in AWS EMR. (Here's a Guide: How to Create and Run EMR Cluster)
- Make sure SSH port is enabled in Security settings of the Master node.
- Connect to the Cluster from local machine using Putty, and the amazon key-pair.
- Create HDFS directories required to run this project.
# Commands to be run on Putty Terminal hadoop fs -mkdir logprocess hadoop fs -mkdir logprocess/input - In another terminal, connect to master node using SSH. (Note: You will need a .pem key to establish this OpenSSH connection).
# Command to be run on Local Terminal sftp -i <aws_key_file.pem> hadoop@<aws_cluster_public_dns_name>
- Once SFTP is connected, transfer the files from your local machine to the master node using the following commands
# Command to be run on Local Terminal put <src_path>/*.logs <dest_path>
- Transfer the Jar file to the master node as well.
# Command to be run on Local Terminal put <src_path>/*.jar <dest_path>
- Store the files in HDFS Directory. Navigate to the directory, where the log files are located.
# PuTTy Terminal hadoop fs -put *.log logprocess/input
- Execute the Map-Reduce Program. See documentations of Job1, Job2, Job3 and Job4 for more information on how to execute each tasks.
This is the pattern for running LogProcessing MapReduce programs.
# PuTTy Terminal hadoop jar LogProcessing-MapReduce-assembly [input-path] [output-path] [job-key] [pattern-key] - Once the output is generated, we can view the results of the jobs directly in the terminal using the following command.
# PuTTy Terminal hadoop fs -cat logprocess/<output_dir>/*
- To Export the output from HDFS File System to Normal file System,
# PuTTy Terminal hadoop fs -cat logprocess/output_dir/* > output.csv
- To transfer output files from remote machine to the local machine,
# Command to be run on Local Terminal with SFTP Connection established get <output_file_name> <local_directory_path> # To transfer directories get -r <output_directory_name> <local_directory_path>