Skip to content

Commit 7ddb070

Browse files
committed
Merge remote-tracking branch 'origin/develop'
2 parents 7f83fc7 + faaf46d commit 7ddb070

35 files changed

+545
-376
lines changed

.github/workflows/continuous_integration.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ name: CI
44
# Define conditions for when to run this action
55
on:
66
pull_request: # Run on all pull requests
7-
push: # Run on all pushes to master
7+
push: # Run on all pushes to main
88
branches:
99
- main
1010
- develop

.gitignore

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@ build/
1919

2020
# Test tools
2121
/.tox/
22-
/tmp/
2322
/.coverage*
2423
*,cover
2524

@@ -35,4 +34,4 @@ build/
3534
*.DS_Store
3635

3736
# Output directories
38-
*output
37+
/output

CONTRIBUTING.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Contributing to michigan-hadoop
2-
========================
1+
Contributing to Madoop
2+
======================
33

44
## Install development environment
55
Set up a development virtual environment.

MANIFEST.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ include README.md
44
include CONTRIBUTING.md
55
include .pylintrc
66
graft tests
7-
include test
7+
graft example
88

99
# Avoid dev and and binary files
1010
exclude tox.ini

README.md

Lines changed: 94 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,106 @@
1-
Michigan Hadoop CLI
2-
=================
3-
4-
Michigan Hadoop CLI (`madoop`) is a light weight, Python-based Hadoop command line interface.
1+
Madoop: Michigan Hadoop
2+
=======================
53

4+
Michigan Hadoop (`madoop`) is a light weight MapReduce framework for education. Madoop implements the [Hadoop Streaming](https://hadoop.apache.org/docs/r1.2.1/streaming.html) interface. Madoop is implemented in Python and runs on a single machine.
65

76
## Quick start
7+
Install and run an example word count MapReduce program.
8+
```console
9+
$ pip install madoop
10+
$ madoop \
11+
-input example/input \
12+
-output output \
13+
-mapper example/map.py \
14+
-reducer example/reduce.py
15+
$ cat output/part-*
16+
autograder 2
17+
world 1
18+
eecs485 1
19+
goodbye 1
20+
hello 3
21+
```
822

23+
24+
## Example
25+
We'll walk through the example in the Quick Start again, providing more detail. For an in-depth explanation of the map and reduce code, see the [Hadoop Streaming tutorial](https://eecs485staff.github.io/p5-search-engine/hadoop_streaming.html).
26+
27+
## Install
28+
Install Madoop. Your version might be different.
929
```console
1030
$ pip install madoop
11-
$ madoop
31+
$ madoop --version
32+
Madoop 0.1.0
1233
```
1334

35+
### Input
36+
We've provided two small input files.
37+
```console
38+
$ cat example/input/input01.txt
39+
hello world
40+
hello eecs485
41+
$ cat example/input/input02.txt
42+
goodbye autograder
43+
hello autograder
44+
```
45+
46+
### Run
47+
Run a MapReduce word count job. By default, there will be one mapper for each input file. Large input files maybe segmented and processed by multiple mappers.
48+
- `-input DIRECTORY` input directory
49+
- `-output DIRECTORY` output directory
50+
- `-mapper FILE` mapper executable
51+
- `-reducer FILE` reducer executable
52+
```console
53+
$ madoop \
54+
-input example/input \
55+
-output output \
56+
-mapper example/map.py \
57+
-reducer example/reduce.py
58+
```
59+
60+
### Output
61+
Concatenate and print output. The concatenation of multiple output files may not be sorted.
62+
```console
63+
$ ls output
64+
part-00000 part-00001 part-00002 part-00003
65+
$ cat output/part-*
66+
autograder 2
67+
world 1
68+
eecs485 1
69+
goodbye 1
70+
hello 3
71+
```
72+
73+
## Comparison with Apache Hadoop and CLI
74+
Madoop implements a subset of the [Hadoop Streaming](https://hadoop.apache.org/docs/r1.2.1/streaming.html) interface. You can simulate the Hadoop Streaming interface at the command line with `cat` and `sort`.
75+
76+
Here's how to run our example MapReduce program on Apache Hadoop.
77+
```console
78+
$ hadoop \
79+
jar path/to/hadoop-streaming-X.Y.Z.jar
80+
-input example/input \
81+
-output output \
82+
-mapper example/map.py \
83+
-reducer example/reduce.py
84+
$ cat output/part-*
85+
```
86+
87+
Here's how to run our example MapReduce program at the command line using `cat` and `sort`.
88+
```console
89+
$ cat input/* | ./map.py | sort | ./reduce.py
90+
```
91+
92+
| Madoop | Hadoop | `cat`/`sort` |
93+
|-|-|-|
94+
| Implement some Hadoop options | All Hadoop options | No Hadoop options |
95+
| Multiple mappers and reducers | Multiple mappers and reducers | One mapper, one reducer |
96+
| Single machine | Many machines | Single Machine |
97+
| `jar hadoop-streaming-X.Y.Z.jar` argument ignored | `jar hadoop-streaming-X.Y.Z.jar` argument required | No arguments |
98+
| Lines within a group are sorted | Lines within a group are sorted | Lines within a group are sorted |
99+
100+
14101
## Contributing
15102
Contributions from the community are welcome! Check out the [guide for contributing](CONTRIBUTING.md).
16103

104+
17105
## Acknowledgments
18-
Michigan Hadoop CLI is written by Andrew DeOrio <[email protected]>.
106+
Michigan Hadoop is written by Andrew DeOrio <[email protected]>.

example/input/input01.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
hello world
2+
hello eecs485

example/input/input02.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
goodbye autograder
2+
hello autograder

example/map.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
#!/usr/bin/env python3
2+
"""Word count mapper."""
3+
import sys
4+
5+
6+
for line in sys.stdin:
7+
words = line.split()
8+
for word in words:
9+
print(f"{word}\t1")

example/reduce.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
#!/usr/bin/env python3
2+
"""Word count reducer."""
3+
import sys
4+
import itertools
5+
6+
7+
def main():
8+
"""Divide sorted lines into groups that share a key."""
9+
for key, group in itertools.groupby(sys.stdin, keyfunc):
10+
reduce_one_group(key, group)
11+
12+
13+
def keyfunc(line):
14+
"""Return the key from a TAB-delimited key-value pair."""
15+
return line.partition("\t")[0]
16+
17+
18+
def reduce_one_group(key, group):
19+
"""Reduce one group."""
20+
word_count = 0
21+
for line in group:
22+
count = line.partition("\t")[2]
23+
word_count += int(count)
24+
print(f"{key}\t{word_count}")
25+
26+
27+
if __name__ == "__main__":
28+
main()

madoop/__init__.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1-
"""Michigan Hadoop CLI API."""
1+
"""Madoop API.
22
3-
from .__main__ import hadoop
3+
Andrew DeOrio <[email protected]>
4+
5+
"""
6+
from .mapreduce import mapreduce
7+
from .exceptions import MadoopError

0 commit comments

Comments
 (0)