|
1 | | -Michigan Hadoop CLI |
2 | | -================= |
3 | | - |
4 | | -Michigan Hadoop CLI (`madoop`) is a light weight, Python-based Hadoop command line interface. |
| 1 | +Madoop: Michigan Hadoop |
| 2 | +======================= |
5 | 3 |
|
| 4 | +Michigan Hadoop (`madoop`) is a light weight MapReduce framework for education. Madoop implements the [Hadoop Streaming](https://hadoop.apache.org/docs/r1.2.1/streaming.html) interface. Madoop is implemented in Python and runs on a single machine. |
6 | 5 |
|
7 | 6 | ## Quick start |
| 7 | +Install and run an example word count MapReduce program. |
| 8 | +```console |
| 9 | +$ pip install madoop |
| 10 | +$ madoop \ |
| 11 | + -input example/input \ |
| 12 | + -output output \ |
| 13 | + -mapper example/map.py \ |
| 14 | + -reducer example/reduce.py |
| 15 | +$ cat output/part-* |
| 16 | +autograder 2 |
| 17 | +world 1 |
| 18 | +eecs485 1 |
| 19 | +goodbye 1 |
| 20 | +hello 3 |
| 21 | +``` |
8 | 22 |
|
| 23 | + |
| 24 | +## Example |
| 25 | +We'll walk through the example in the Quick Start again, providing more detail. For an in-depth explanation of the map and reduce code, see the [Hadoop Streaming tutorial](https://eecs485staff.github.io/p5-search-engine/hadoop_streaming.html). |
| 26 | + |
| 27 | +## Install |
| 28 | +Install Madoop. Your version might be different. |
9 | 29 | ```console |
10 | 30 | $ pip install madoop |
11 | | -$ madoop |
| 31 | +$ madoop --version |
| 32 | +Madoop 0.1.0 |
12 | 33 | ``` |
13 | 34 |
|
| 35 | +### Input |
| 36 | +We've provided two small input files. |
| 37 | +```console |
| 38 | +$ cat example/input/input01.txt |
| 39 | +hello world |
| 40 | +hello eecs485 |
| 41 | +$ cat example/input/input02.txt |
| 42 | +goodbye autograder |
| 43 | +hello autograder |
| 44 | +``` |
| 45 | + |
| 46 | +### Run |
| 47 | +Run a MapReduce word count job. By default, there will be one mapper for each input file. Large input files maybe segmented and processed by multiple mappers. |
| 48 | +- `-input DIRECTORY` input directory |
| 49 | +- `-output DIRECTORY` output directory |
| 50 | +- `-mapper FILE` mapper executable |
| 51 | +- `-reducer FILE` reducer executable |
| 52 | +```console |
| 53 | +$ madoop \ |
| 54 | + -input example/input \ |
| 55 | + -output output \ |
| 56 | + -mapper example/map.py \ |
| 57 | + -reducer example/reduce.py |
| 58 | +``` |
| 59 | + |
| 60 | +### Output |
| 61 | +Concatenate and print output. The concatenation of multiple output files may not be sorted. |
| 62 | +```console |
| 63 | +$ ls output |
| 64 | +part-00000 part-00001 part-00002 part-00003 |
| 65 | +$ cat output/part-* |
| 66 | +autograder 2 |
| 67 | +world 1 |
| 68 | +eecs485 1 |
| 69 | +goodbye 1 |
| 70 | +hello 3 |
| 71 | +``` |
| 72 | + |
| 73 | +## Comparison with Apache Hadoop and CLI |
| 74 | +Madoop implements a subset of the [Hadoop Streaming](https://hadoop.apache.org/docs/r1.2.1/streaming.html) interface. You can simulate the Hadoop Streaming interface at the command line with `cat` and `sort`. |
| 75 | + |
| 76 | +Here's how to run our example MapReduce program on Apache Hadoop. |
| 77 | +```console |
| 78 | +$ hadoop \ |
| 79 | + jar path/to/hadoop-streaming-X.Y.Z.jar |
| 80 | + -input example/input \ |
| 81 | + -output output \ |
| 82 | + -mapper example/map.py \ |
| 83 | + -reducer example/reduce.py |
| 84 | +$ cat output/part-* |
| 85 | +``` |
| 86 | + |
| 87 | +Here's how to run our example MapReduce program at the command line using `cat` and `sort`. |
| 88 | +```console |
| 89 | +$ cat input/* | ./map.py | sort | ./reduce.py |
| 90 | +``` |
| 91 | + |
| 92 | +| Madoop | Hadoop | `cat`/`sort` | |
| 93 | +|-|-|-| |
| 94 | +| Implement some Hadoop options | All Hadoop options | No Hadoop options | |
| 95 | +| Multiple mappers and reducers | Multiple mappers and reducers | One mapper, one reducer | |
| 96 | +| Single machine | Many machines | Single Machine | |
| 97 | +| `jar hadoop-streaming-X.Y.Z.jar` argument ignored | `jar hadoop-streaming-X.Y.Z.jar` argument required | No arguments | |
| 98 | +| Lines within a group are sorted | Lines within a group are sorted | Lines within a group are sorted | |
| 99 | + |
| 100 | + |
14 | 101 | ## Contributing |
15 | 102 | Contributions from the community are welcome! Check out the [guide for contributing](CONTRIBUTING.md). |
16 | 103 |
|
| 104 | + |
17 | 105 | ## Acknowledgments |
18 | | -Michigan Hadoop CLI is written by Andrew DeOrio <[email protected]>. |
| 106 | +Michigan Hadoop is written by Andrew DeOrio <[email protected]>. |
0 commit comments