eecs485staff
diff --git a/‎.github/workflows/continuous_integration.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/continuous_integration.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 2 deletions b/‎.gitignore‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 2 additions & 2 deletions b/‎CONTRIBUTING.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎MANIFEST.in‎
Lines changed: 1 addition & 1 deletion b/‎MANIFEST.in‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 94 additions & 6 deletions b/‎README.md‎
Lines changed: 94 additions & 6 deletions
diff --git a/‎example/input/input01.txt‎
Lines changed: 2 additions & 0 deletions b/‎example/input/input01.txt‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎example/input/input02.txt‎
Lines changed: 2 additions & 0 deletions b/‎example/input/input02.txt‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎example/map.py‎
Lines changed: 9 additions & 0 deletions b/‎example/map.py‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎example/reduce.py‎
Lines changed: 28 additions & 0 deletions b/‎example/reduce.py‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎madoop/__init__.py‎
Lines changed: 6 additions & 2 deletions b/‎madoop/__init__.py‎
Lines changed: 6 additions & 2 deletions
@@ -4,7 +4,7 @@ name: CI
 # Define conditions for when to run this action
 on:
   pull_request: # Run on all pull requests
-  push: # Run on all pushes to master
+  push: # Run on all pushes to main
     branches:
       - main
       - develop
 
@@ -19,7 +19,6 @@ build/
 
 # Test tools
 /.tox/
-/tmp/
 /.coverage*
 *,cover
 
@@ -35,4 +34,4 @@ build/
 *.DS_Store
 
 # Output directories
-*output
+/output
@@ -1,5 +1,5 @@
-Contributing to michigan-hadoop
-========================
+Contributing to Madoop
+======================
 
 ## Install development environment
 Set up a development virtual environment.
 
@@ -4,7 +4,7 @@ include README.md
 include CONTRIBUTING.md
 include .pylintrc
 graft tests
-include test
+graft example
 
 # Avoid dev and and binary files
 exclude tox.ini
 
@@ -1,18 +1,106 @@
-Michigan Hadoop CLI
-=================
-
-Michigan Hadoop CLI (`madoop`) is a light weight, Python-based Hadoop command line interface.
+Madoop: Michigan Hadoop
+=======================
 
+Michigan Hadoop (`madoop`) is a light weight MapReduce framework for education.  Madoop implements the [Hadoop Streaming](https://hadoop.apache.org/docs/r1.2.1/streaming.html) interface.  Madoop is implemented in Python and runs on a single machine.
 
 ## Quick start
+Install and run an example word count MapReduce program.
+```console
+$ pip install madoop
+$ madoop \
+  -input example/input \
+  -output output \
+  -mapper example/map.py \
+  -reducer example/reduce.py
+$ cat output/part-*
+autograder	2
+world	1
+eecs485	1
+goodbye	1
+hello	3
+```
 
+
+## Example
+We'll walk through the example in the Quick Start again, providing more detail.  For an in-depth explanation of the map and reduce code, see the [Hadoop Streaming tutorial](https://eecs485staff.github.io/p5-search-engine/hadoop_streaming.html).
+
+## Install
+Install Madoop.  Your version might be different.
 ```console
 $ pip install madoop
-$ madoop
+$ madoop --version
+Madoop 0.1.0
 ```
 
+### Input
+We've provided two small input files.
+```console
+$ cat example/input/input01.txt
+hello world
+hello eecs485
+$ cat example/input/input02.txt
+goodbye autograder
+hello autograder
+```
+
+### Run
+Run a MapReduce word count job.  By default, there will be one mapper for each input file.  Large input files maybe segmented and processed by multiple mappers.
+- `-input DIRECTORY` input directory
+- `-output DIRECTORY` output directory
+- `-mapper FILE` mapper executable
+- `-reducer FILE` reducer executable
+```console
+$ madoop \
+    -input example/input \
+    -output output \
+    -mapper example/map.py \
+    -reducer example/reduce.py
+```
+
+### Output
+Concatenate and print output.  The concatenation of multiple output files may not be sorted.
+```console
+$ ls output
+part-00000  part-00001  part-00002  part-00003
+$ cat output/part-*
+autograder	2
+world	1
+eecs485	1
+goodbye	1
+hello	3
+```
+
+## Comparison with Apache Hadoop and CLI
+Madoop implements a subset of the [Hadoop Streaming](https://hadoop.apache.org/docs/r1.2.1/streaming.html) interface.  You can simulate the Hadoop Streaming interface at the command line with `cat` and `sort`.
+
+Here's how to run our example MapReduce program on Apache Hadoop.
+```console
+$ hadoop \
+    jar path/to/hadoop-streaming-X.Y.Z.jar
+    -input example/input \
+    -output output \
+    -mapper example/map.py \
+    -reducer example/reduce.py
+$ cat output/part-*
+```
+
+Here's how to run our example MapReduce program at the command line using `cat` and `sort`.
+```console
+$ cat input/* | ./map.py | sort | ./reduce.py
+```
+
+| Madoop | Hadoop | `cat`/`sort` |
+|-|-|-|
+| Implement some Hadoop options | All Hadoop options | No Hadoop options |
+| Multiple mappers and reducers | Multiple mappers and reducers | One mapper, one reducer |
+| Single machine | Many machines | Single Machine |
+| `jar hadoop-streaming-X.Y.Z.jar` argument ignored | `jar hadoop-streaming-X.Y.Z.jar` argument required | No arguments |
+| Lines within a group are sorted | Lines within a group are sorted | Lines within a group are sorted |
+
+
 ## Contributing
 Contributions from the community are welcome! Check out the [guide for contributing](CONTRIBUTING.md).
 
+
 ## Acknowledgments
-Michigan Hadoop CLI is written by Andrew DeOrio <[email protected]>.
+Michigan Hadoop is written by Andrew DeOrio <[email protected]>.
@@ -0,0 +1,2 @@
+hello world
+hello eecs485
@@ -0,0 +1,2 @@
+goodbye autograder
+hello autograder
@@ -0,0 +1,9 @@
+#!/usr/bin/env python3
+"""Word count mapper."""
+import sys
+
+
+for line in sys.stdin:
+    words = line.split()
+    for word in words:
+        print(f"{word}\t1")
@@ -0,0 +1,28 @@
+#!/usr/bin/env python3
+"""Word count reducer."""
+import sys
+import itertools
+
+
+def main():
+    """Divide sorted lines into groups that share a key."""
+    for key, group in itertools.groupby(sys.stdin, keyfunc):
+        reduce_one_group(key, group)
+
+
+def keyfunc(line):
+    """Return the key from a TAB-delimited key-value pair."""
+    return line.partition("\t")[0]
+
+
+def reduce_one_group(key, group):
+    """Reduce one group."""
+    word_count = 0
+    for line in group:
+        count = line.partition("\t")[2]
+        word_count += int(count)
+    print(f"{key}\t{word_count}")
+
+
+if __name__ == "__main__":
+    main()
@@ -1,3 +1,7 @@
-"""Michigan Hadoop CLI API."""
+"""Madoop API.
 
-from .__main__ import hadoop
+Andrew DeOrio <[email protected]>
+
+"""
+from .mapreduce import mapreduce
+from .exceptions import MadoopError
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+goodbye autograder`
	`2`	`+hello autograder`