Skip to content

Commit 5067e94

Browse files
committed
Merge remote-tracking branch 'origin/develop'
2 parents 4c21100 + a65488c commit 5067e94

File tree

5 files changed

+27
-20
lines changed

5 files changed

+27
-20
lines changed

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Contributing to Madoop
55
Set up a development virtual environment.
66
```console
77
$ python3 -m venv .venv
8-
$ source env/bin/activate
8+
$ source .venv/bin/activate
99
$ pip install --editable .[dev,test]
1010
```
1111

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Madoop: Michigan Hadoop
77

88
Michigan Hadoop (`madoop`) is a light weight MapReduce framework for education. Madoop implements the [Hadoop Streaming](https://hadoop.apache.org/docs/r1.2.1/streaming.html) interface. Madoop is implemented in Python and runs on a single machine.
99

10-
For an in-depth explanation of how to write MapReduce programs in Python for Hadoop Streaming, see our [Hadoop Streaming tutorial](README_hadoop_streaming.md).
10+
For an in-depth explanation of how to write MapReduce programs in Python for Hadoop Streaming, see our [Hadoop Streaming tutorial](README_Hadoop_Streaming.md).
1111

1212

1313
## Quick start

README_Hadoop_Streaming.md

Lines changed: 23 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,11 @@ example
2222

2323
Execute the example MapReduce program using Madoop and show the output.
2424
```console
25-
$ cd example
2625
$ madoop \
27-
-input input \
28-
-output output \
29-
-mapper map.py \
30-
-reducer reduce.py
26+
-input example/input \
27+
-output example/output \
28+
-mapper example/map.py \
29+
-reducer example/reduce.py
3130
3231
$ cat output/part-*
3332
Goodbye 1
@@ -40,6 +39,9 @@ Hello 2
4039
## Overview
4140
[Hadoop Streaming](https://hadoop.apache.org/docs/r1.2.1/streaming.html) is a MapReduce API that works with any programming language. The mapper and the reducer are executables that read input from stdin and write output to stdout.
4241

42+
## Partition
43+
The MapReduce framework begins by partitioning (splitting) the input. If the input size is large, a real MapReduce framework will break it up into smaller chunks. Each Map execution will process one input partition. In this tutorial, we're faking MapReduce at the command line with a single mapper, so we'll skip the partition step.
44+
4345
## Map
4446
The mapper is an executable that reads input from stdin and writes output to stdout. Here's an example `map.py` which is part of a word count MapReduce program.
4547
```python
@@ -109,7 +111,7 @@ def main():
109111
word, _, count = line.partition("\t")
110112
word_count[word] += int(count)
111113
for word, count in word_count.items():
112-
print(f"{word}\t{count}")
114+
print(f"{word} {count}")
113115

114116
if __name__ == "__main__":
115117
main()
@@ -151,9 +153,9 @@ The reduce output format is up to the programmer. Here's how to run the whole w
151153
$ cat input/input* | python3 map.py | sort | python3 reduce.py
152154
Bye 1
153155
Goodbye 1
154-
Hadoop 2
155-
Hello 2
156-
World 2
156+
Hadoop 2
157+
Hello 2
158+
World 2
157159
```
158160

159161
## `itertools.groupby()`
@@ -172,7 +174,7 @@ def main():
172174
word, _, count = line.partition("\t")
173175
word_count[word] += int(count)
174176
for word, count in word_count.items():
175-
print(f"{word}\t{count}")
177+
print(f"{word} {count}")
176178
```
177179

178180
If one reducer execution received one group, we could simplify the reducer and use only O(1) memory.
@@ -184,7 +186,7 @@ def reduce_one_group(key, group):
184186
for line in group:
185187
count = line.partition("\t")[2]
186188
word_count += int(count)
187-
print(f"{key}\t{word_count}")
189+
print(f"{key} {word_count}")
188190
```
189191

190192
If one reducer execution input contains multiple groups, how can we process one group at a time? We'll use `itertools.groupby()`.
@@ -238,24 +240,28 @@ def reduce_one_group(key, group):
238240
for line in group:
239241
count = line.partition("\t")[2]
240242
word_count += int(count)
241-
print(f"{key}\t{word_count}")
243+
print(f"{key} {word_count}")
242244
```
243245

244246
Finally, we can run our entire MapReduce program.
245247
```console
246-
$ cat input/* | ./map.py | sort| ./reduce.py
248+
$ cat input/* | ./map.py | sort | ./reduce.py
247249
Bye 1
248250
Goodbye 1
249-
Hadoop 2
250-
Hello 2
251-
World 2
251+
Hadoop 2
252+
Hello 2
253+
World 2
252254
```
253255

254256
### Template reducer
255257
Here's a template you can copy-paste to get started on a reducer. The only part you need to edit is the `"IMPLEMENT ME"` line.
256258
```python
257259
#!/usr/bin/env python3
258-
"""Word count reducer."""
260+
"""
261+
Template reducer.
262+
263+
https://github.com/eecs485staff/madoop/blob/main/README_Hadoop_Streaming.md
264+
"""
259265
import sys
260266
import itertools
261267

madoop/mapreduce.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,7 @@ def is_executable(exe):
156156
result in difficult-to-understand error messages.
157157
158158
"""
159+
exe = pathlib.Path(exe).resolve()
159160
try:
160161
subprocess.run(
161162
str(exe),

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
description="A light weight MapReduce framework for education.",
1515
long_description=LONG_DESCRIPTION,
1616
long_description_content_type="text/markdown",
17-
version="1.0.1",
17+
version="1.0.2",
1818
author="Andrew DeOrio",
1919
author_email="[email protected]",
2020
url="https://github.com/eecs485staff/madoop/",

0 commit comments

Comments
 (0)