Added minor doc in readme

pakozm · pakozm · commit 76a96ba0439e · 2014-05-03T17:48:02.000+02:00
diff --git a/README.md b/README.md
@@ -14,3 +14,197 @@ This software depends in:
 - [pakozm/luamongo](https://github.com/pakozm/luamongo/), a fork of 
   [moai/luamongo](https://github.com/moai/luamongo) for Lua 5.2 and with minor
   improvements.
+
+Installation
+------------
+
+Copy the `mapreduce` directory to a place visible from your `LUA_PATH`
+environment variable. In the same way, in order to test the example, you need to
+put the `examples` directory visible through your `LUA_PATH`. It is possible to
+add the active directory by writing in the terminal:
+
+```
+$ export LUA_PATH='?.lua;?/init.lua'
+```
+
+Usage
+-----
+
+Two Lua scripts have been prepared for fast running of the software.
+
+- `execute_server.lua` runs the master server for your map-reduce operation.
+  Only **one instance** of this script is needed. Note that this software
+  receives the **map-reduce task** splitted into several Lua modules. These
+  modules had to be visible in the `LUA_PATH` of the server and all the workers
+  that you execute. This script receives 7 mandatory arguments:
+  
+    1. The connection string, normally `localhost` or `localhost:21707`.
+    2. The name of the database where the work will be done.
+    3. A Lua module which contains the **task** function data.
+    4. A Lua module which contains the **map** function data.
+    5. A Lua module which contains the **partition** function data.
+    6. A Lua module which contains the **reduce** function data.
+    7. A Lua module which contains the **final** function data.
+
+- `execute_worker.lua` runs the worker, which is configured by default to
+  execute one map-reduce task and finish its operation. One task doesn't mean
+  one job. A **map-reduce task** is performed as several individual **map/reduce
+  jobs**. A worker waits until all the possible map or reduce jobs are completed
+  to consider a task as finished. This script receives two arguments:
+
+    1. The connection string, as above.
+    2. The name of the database where the work will be done, as above.
+
+A simple word-count example is available in the repository. There are two
+shell-scripts: `execute_server_example.sh` and `execute_worker_example.sh`;
+which are ready to run the word-count example in only one machine, with one or
+more worker instances. The execution of the example looks like this:
+
+**SERVER**
+```
+$ ./execute_example_server.sh > output
+# Preparing MAP
+# MAP execution
+ 100.0 % 
+# Preparing REDUCE
+# 	 MERGE AND PARTITIONING
+	 100.0 % 
+# 	 CREATING JOBS
+# 	 STARTING REDUCE
+# REDUCE execution
+ 100.0 % 
+# FINAL execution
+```
+
+**WORKER**
+```
+$ ./execute_example_worker.sh 
+# NEW TASK READY
+# 	 EXECUTING MAP JOB _id: "1"
+# 		 FINISHED
+# 	 EXECUTING MAP JOB _id: "2"
+# 		 FINISHED
+# 	 EXECUTING MAP JOB _id: "3"
+# 		 FINISHED
+# 	 EXECUTING MAP JOB _id: "4"
+# 		 FINISHED
+# 	 EXECUTING REDUCE JOB _id: "121"
+# 		 FINISHED
+# 	 EXECUTING REDUCE JOB _id: "37"
+# 		 FINISHED
+...
+```
+
+Map-reduce task example: word-count
+-----------------------------------
+
+The example is composed by one Lua module for each of the map-reduce functions,
+and are available at the directory `examples/WordCount/`. All the modules has
+the same structure, they return a Lua table with two fields:
+
+- **init** function, which receives a table of arguments and allows to configure
+  your module options, in case that you need any option.
+
+- **func** function, which implements the necessary Lua code.
+
+A map-reduce task is divided, at least, in the following modules:
+
+- **taskfn.lua** is the script which defines how the data is divided in order to
+  create **map jobs**. The **func** field is executed as a Lua *coroutine*, so,
+  every map job will be created by calling `corotuine.yield(key,value)`.
+
+```Lua
+-- arg is for configuration purposes, it is allowed in any of the scripts
+local init = function(arg)
+  -- do whatever you need for initialization parametrized by arg table
+end
+return {
+  init = init,
+  func = function()
+    coroutine.yield(1,"mapreduce/server.lua")
+    coroutine.yield(2,"mapreduce/worker.lua")
+    coroutine.yield(3,"mapreduce/test.lua")
+    coroutine.yield(4,"mapreduce/utils.lua")
+  end
+}
+```
+
+- **mapfn.lua** is the script where the map function is implemented. The
+  **func** field is executed as a standard Lua function, and receives tow
+  arguments `(key,value)` generated by one of the yields at your `taskfn`
+  script. Map results are produced by calling the global function
+  `emit(key,value)`.
+
+```Lua
+return {
+  init = function() end,
+  func = function(key,value)
+    for line in io.lines(value) do
+      for w in line:gmatch("[^%s]+") do
+        emit(w,1)
+      end
+    end
+  end
+}
+```
+
+- **partitionfn.lua** is the script which describes how the map results are
+  grouped and partitioned in order to create **reduce jobs**. The **func** field
+  is a hash function which receives an emitted key and returns an integer
+  number. Depending in your hash function, more or less reducers will be needed.
+
+```Lua
+return {
+  init = function() end,
+  func = function(key)
+    return key:byte(#key) -- last character (numeric byte)
+  end
+}
+```
+  
+- **reducefn.lua** is the script which implements the reduce function. The
+  **func** field is a function which receives a pair `(key,values)` where the
+  `key` is one of the emitted keys, and the `values` is a Lua array (table with
+  integer and sequential keys starting at 1) with all the available map values
+  for the given key. The system could reuse the reduce function several times,
+  so, it must be idempotent. The reduce results will be grouped following the
+  partition function. For each possible partition, a GridFS file will be created
+  in a collection called `dbname_fs` where dbname is the database name defined
+  above.
+
+```Lua
+return {
+  init = function() end,
+  func = function(key,values)
+    local count=0
+    for _,v in ipairs(values) do count = count + v end
+    return count
+  end
+}
+```
+
+- **finalfn.lua** is the script which implements how to take the results
+  produced by the system. The **func** field is a function which receives a
+  Lua pairs iterator, and returns a boolean indicating if to destroy or not
+  the GridFS collection data. If the returned value is `true`, the results
+  will be removed. If the returned value is `false` or `nil`, the results
+  will be available after the execution of your map-reduce task.
+
+```Lua
+return {
+  init = function() end,
+  func = function(it)
+    for key,value in it do
+      print(value,key)
+    end
+    return true -- indicates to remove mongo gridfs result files
+  end
+}
+```
+
+Last notes
+----------
+
+This software is in development. More documentation will be added to the
+wiki pages, while we have time to do that. Collaboration is open, and all your
+contributions will be welcome.