@@ -14,3 +14,197 @@ This software depends in:
14
14
- [ pakozm/luamongo] ( https://github.com/pakozm/luamongo/ ) , a fork of
15
15
[ moai/luamongo] ( https://github.com/moai/luamongo ) for Lua 5.2 and with minor
16
16
improvements.
17
+
18
+ Installation
19
+ ------------
20
+
21
+ Copy the ` mapreduce ` directory to a place visible from your ` LUA_PATH `
22
+ environment variable. In the same way, in order to test the example, you need to
23
+ put the ` examples ` directory visible through your ` LUA_PATH ` . It is possible to
24
+ add the active directory by writing in the terminal:
25
+
26
+ ```
27
+ $ export LUA_PATH='?.lua;?/init.lua'
28
+ ```
29
+
30
+ Usage
31
+ -----
32
+
33
+ Two Lua scripts have been prepared for fast running of the software.
34
+
35
+ - ` execute_server.lua ` runs the master server for your map-reduce operation.
36
+ Only ** one instance** of this script is needed. Note that this software
37
+ receives the ** map-reduce task** splitted into several Lua modules. These
38
+ modules had to be visible in the ` LUA_PATH ` of the server and all the workers
39
+ that you execute. This script receives 7 mandatory arguments:
40
+
41
+ 1 . The connection string, normally ` localhost ` or ` localhost:21707 ` .
42
+ 2 . The name of the database where the work will be done.
43
+ 3 . A Lua module which contains the ** task** function data.
44
+ 4 . A Lua module which contains the ** map** function data.
45
+ 5 . A Lua module which contains the ** partition** function data.
46
+ 6 . A Lua module which contains the ** reduce** function data.
47
+ 7 . A Lua module which contains the ** final** function data.
48
+
49
+ - ` execute_worker.lua ` runs the worker, which is configured by default to
50
+ execute one map-reduce task and finish its operation. One task doesn't mean
51
+ one job. A ** map-reduce task** is performed as several individual ** map/reduce
52
+ jobs** . A worker waits until all the possible map or reduce jobs are completed
53
+ to consider a task as finished. This script receives two arguments:
54
+
55
+ 1 . The connection string, as above.
56
+ 2 . The name of the database where the work will be done, as above.
57
+
58
+ A simple word-count example is available in the repository. There are two
59
+ shell-scripts: ` execute_server_example.sh ` and ` execute_worker_example.sh ` ;
60
+ which are ready to run the word-count example in only one machine, with one or
61
+ more worker instances. The execution of the example looks like this:
62
+
63
+ ** SERVER**
64
+ ```
65
+ $ ./execute_example_server.sh > output
66
+ # Preparing MAP
67
+ # MAP execution
68
+ 100.0 %
69
+ # Preparing REDUCE
70
+ # MERGE AND PARTITIONING
71
+ 100.0 %
72
+ # CREATING JOBS
73
+ # STARTING REDUCE
74
+ # REDUCE execution
75
+ 100.0 %
76
+ # FINAL execution
77
+ ```
78
+
79
+ ** WORKER**
80
+ ```
81
+ $ ./execute_example_worker.sh
82
+ # NEW TASK READY
83
+ # EXECUTING MAP JOB _id: "1"
84
+ # FINISHED
85
+ # EXECUTING MAP JOB _id: "2"
86
+ # FINISHED
87
+ # EXECUTING MAP JOB _id: "3"
88
+ # FINISHED
89
+ # EXECUTING MAP JOB _id: "4"
90
+ # FINISHED
91
+ # EXECUTING REDUCE JOB _id: "121"
92
+ # FINISHED
93
+ # EXECUTING REDUCE JOB _id: "37"
94
+ # FINISHED
95
+ ...
96
+ ```
97
+
98
+ Map-reduce task example: word-count
99
+ -----------------------------------
100
+
101
+ The example is composed by one Lua module for each of the map-reduce functions,
102
+ and are available at the directory ` examples/WordCount/ ` . All the modules has
103
+ the same structure, they return a Lua table with two fields:
104
+
105
+ - ** init** function, which receives a table of arguments and allows to configure
106
+ your module options, in case that you need any option.
107
+
108
+ - ** func** function, which implements the necessary Lua code.
109
+
110
+ A map-reduce task is divided, at least, in the following modules:
111
+
112
+ - ** taskfn.lua** is the script which defines how the data is divided in order to
113
+ create ** map jobs** . The ** func** field is executed as a Lua * coroutine* , so,
114
+ every map job will be created by calling ` corotuine.yield(key,value) ` .
115
+
116
+ ``` Lua
117
+ -- arg is for configuration purposes, it is allowed in any of the scripts
118
+ local init = function (arg )
119
+ -- do whatever you need for initialization parametrized by arg table
120
+ end
121
+ return {
122
+ init = init ,
123
+ func = function ()
124
+ coroutine.yield (1 ," mapreduce/server.lua" )
125
+ coroutine.yield (2 ," mapreduce/worker.lua" )
126
+ coroutine.yield (3 ," mapreduce/test.lua" )
127
+ coroutine.yield (4 ," mapreduce/utils.lua" )
128
+ end
129
+ }
130
+ ```
131
+
132
+ - ** mapfn.lua** is the script where the map function is implemented. The
133
+ ** func** field is executed as a standard Lua function, and receives tow
134
+ arguments ` (key,value) ` generated by one of the yields at your ` taskfn `
135
+ script. Map results are produced by calling the global function
136
+ ` emit(key,value) ` .
137
+
138
+ ``` Lua
139
+ return {
140
+ init = function () end ,
141
+ func = function (key ,value )
142
+ for line in io.lines (value ) do
143
+ for w in line :gmatch (" [^%s]+" ) do
144
+ emit (w ,1 )
145
+ end
146
+ end
147
+ end
148
+ }
149
+ ```
150
+
151
+ - ** partitionfn.lua** is the script which describes how the map results are
152
+ grouped and partitioned in order to create ** reduce jobs** . The ** func** field
153
+ is a hash function which receives an emitted key and returns an integer
154
+ number. Depending in your hash function, more or less reducers will be needed.
155
+
156
+ ``` Lua
157
+ return {
158
+ init = function () end ,
159
+ func = function (key )
160
+ return key :byte (# key ) -- last character (numeric byte)
161
+ end
162
+ }
163
+ ```
164
+
165
+ - ** reducefn.lua** is the script which implements the reduce function. The
166
+ ** func** field is a function which receives a pair ` (key,values) ` where the
167
+ ` key ` is one of the emitted keys, and the ` values ` is a Lua array (table with
168
+ integer and sequential keys starting at 1) with all the available map values
169
+ for the given key. The system could reuse the reduce function several times,
170
+ so, it must be idempotent. The reduce results will be grouped following the
171
+ partition function. For each possible partition, a GridFS file will be created
172
+ in a collection called ` dbname_fs ` where dbname is the database name defined
173
+ above.
174
+
175
+ ``` Lua
176
+ return {
177
+ init = function () end ,
178
+ func = function (key ,values )
179
+ local count = 0
180
+ for _ ,v in ipairs (values ) do count = count + v end
181
+ return count
182
+ end
183
+ }
184
+ ```
185
+
186
+ - ** finalfn.lua** is the script which implements how to take the results
187
+ produced by the system. The ** func** field is a function which receives a
188
+ Lua pairs iterator, and returns a boolean indicating if to destroy or not
189
+ the GridFS collection data. If the returned value is ` true ` , the results
190
+ will be removed. If the returned value is ` false ` or ` nil ` , the results
191
+ will be available after the execution of your map-reduce task.
192
+
193
+ ``` Lua
194
+ return {
195
+ init = function () end ,
196
+ func = function (it )
197
+ for key ,value in it do
198
+ print (value ,key )
199
+ end
200
+ return true -- indicates to remove mongo gridfs result files
201
+ end
202
+ }
203
+ ```
204
+
205
+ Last notes
206
+ ----------
207
+
208
+ This software is in development. More documentation will be added to the
209
+ wiki pages, while we have time to do that. Collaboration is open, and all your
210
+ contributions will be welcome.
0 commit comments