Merge remote-tracking branch 'origin/develop'

awdeorio · awdeorio · commit 5067e949421b · 2022-11-07T10:29:19.000-05:00
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -5,7 +5,7 @@ Contributing to Madoop
 Set up a development virtual environment.
 ```console
 $ python3 -m venv .venv
-$ source env/bin/activate
+$ source .venv/bin/activate
 $ pip install --editable .[dev,test]
 ```
 
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ Madoop: Michigan Hadoop
 
 Michigan Hadoop (`madoop`) is a light weight MapReduce framework for education.  Madoop implements the [Hadoop Streaming](https://hadoop.apache.org/docs/r1.2.1/streaming.html) interface.  Madoop is implemented in Python and runs on a single machine.
 
-For an in-depth explanation of how to write MapReduce programs in Python for Hadoop Streaming, see our [Hadoop Streaming tutorial](README_hadoop_streaming.md).
+For an in-depth explanation of how to write MapReduce programs in Python for Hadoop Streaming, see our [Hadoop Streaming tutorial](README_Hadoop_Streaming.md).
 
 
 ## Quick start
diff --git a/README_Hadoop_Streaming.md b/README_Hadoop_Streaming.md
@@ -22,12 +22,11 @@ example
 
 Execute the example MapReduce program using Madoop and show the output.
 ```console
-$ cd example
 $ madoop \
-  -input input \
-  -output output \
-  -mapper map.py \
-  -reducer reduce.py
+  -input example/input \
+  -output example/output \
+  -mapper example/map.py \
+  -reducer example/reduce.py
   
 $ cat output/part-*
 Goodbye 1
@@ -40,6 +39,9 @@ Hello 2
 ## Overview
 [Hadoop Streaming](https://hadoop.apache.org/docs/r1.2.1/streaming.html) is a MapReduce API that works with any programming language.  The mapper and the reducer are executables that read input from stdin and write output to stdout.
 
+## Partition
+The MapReduce framework begins by partitioning (splitting) the input. If the input size is large, a real MapReduce framework will break it up into smaller chunks. Each Map execution will process one input partition. In this tutorial, we're faking MapReduce at the command line with a single mapper, so we'll skip the partition step.
+
 ## Map
 The mapper is an executable that reads input from stdin and writes output to stdout.  Here's an example `map.py` which is part of a word count MapReduce program.
 ```python
@@ -109,7 +111,7 @@ def main():
         word, _, count = line.partition("\t")
         word_count[word] += int(count)
     for word, count in word_count.items():
-        print(f"{word}\t{count}")
+        print(f"{word} {count}")
 
 if __name__ == "__main__":
     main()
@@ -151,9 +153,9 @@ The reduce output format is up to the programmer.  Here's how to run the whole w
 $ cat input/input* | python3 map.py | sort | python3 reduce.py
 Bye	1
 Goodbye	1
-Hadoop	2
-Hello	2
-World	2
+Hadoop 2
+Hello 2
+World 2
 ```
 
 ## `itertools.groupby()`
@@ -172,7 +174,7 @@ def main():
         word, _, count = line.partition("\t")
         word_count[word] += int(count)
     for word, count in word_count.items():
-        print(f"{word}\t{count}")
+        print(f"{word} {count}")
 ```
 
 If one reducer execution received one group, we could simplify the reducer and use only O(1) memory.
@@ -184,7 +186,7 @@ def reduce_one_group(key, group):
     for line in group:
         count = line.partition("\t")[2]
         word_count += int(count)
-    print(f"{key}\t{word_count}")
+    print(f"{key} {word_count}")
 ```
 
 If one reducer execution input contains multiple groups, how can we process one group at a time?  We'll use `itertools.groupby()`.
@@ -238,24 +240,28 @@ def reduce_one_group(key, group):
     for line in group:
         count = line.partition("\t")[2]
         word_count += int(count)
-    print(f"{key}\t{word_count}")
+    print(f"{key} {word_count}")
 ```
 
 Finally, we can run our entire MapReduce program.
 ```console
-$ cat input/* | ./map.py | sort| ./reduce.py
+$ cat input/* | ./map.py | sort | ./reduce.py
 Bye	1
 Goodbye	1
-Hadoop	2
-Hello	2
-World	2
+Hadoop 2
+Hello 2
+World 2
 ```
 
 ### Template reducer
 Here's a template you can copy-paste to get started on a reducer.  The only part you need to edit is the `"IMPLEMENT ME"` line.
 ```python
 #!/usr/bin/env python3
-"""Word count reducer."""
+"""
+Template reducer.
+
+https://github.com/eecs485staff/madoop/blob/main/README_Hadoop_Streaming.md
+"""
 import sys
 import itertools
 
diff --git a/madoop/mapreduce.py b/madoop/mapreduce.py
@@ -156,6 +156,7 @@ def is_executable(exe):
     result in difficult-to-understand error messages.
 
     """
+    exe = pathlib.Path(exe).resolve()
     try:
         subprocess.run(
             str(exe),
diff --git a/setup.py b/setup.py
@@ -14,7 +14,7 @@
     description="A light weight MapReduce framework for education.",
     long_description=LONG_DESCRIPTION,
     long_description_content_type="text/markdown",
-    version="1.0.1",
+    version="1.0.2",
     author="Andrew DeOrio",
     author_email="awdeorio@umich.edu",
     url="https://github.com/eecs485staff/madoop/",