|
| 1 | +# ML4all: scalable ML system for everyone |
| 2 | + |
| 3 | + |
| 4 | +ML4all is a system that frees users from the burden of machine learning algorithm selection and low-level implementation details. |
| 5 | +It uses a new abstraction that is capable of solving most ML tasks and provides a cost-based optimizer on top of the proposed abstraction for choosing the best gradient descent algorithm in a given setting. |
| 6 | +Our results show that ML4all is more than two orders of magnitude faster than state-of-the-art systems and can process large datasets that were not possible before. |
| 7 | + |
| 8 | +More details can be found in our dedicated [SIGMOD publication](https://dl.acm.org/citation.cfm?id=3064042) and |
| 9 | +in Wayang's core [system paper](https://sigmodrecord.org/publications/sigmodRecord/2309/pdfs/05_Systems_Beedkar.pdf). |
| 10 | + |
| 11 | +## Abstraction |
| 12 | +ML4all abstracts most ML algorithms with seven operators: |
| 13 | + |
| 14 | +- (1) `Transform` receives a data point to transform |
| 15 | +(e.g., normalize it) and outputs a new data point. |
| 16 | + |
| 17 | +- (2) `Stage` initializes all the required global param- |
| 18 | +eters (e.g., centroids for the k-means algorithm). |
| 19 | + |
| 20 | +- (3) `Compute` performs user-defined computations |
| 21 | +on the input data point and returns a new data |
| 22 | +point. For example, it can compute the nearest cen- |
| 23 | +troid for each input data point. |
| 24 | + |
| 25 | +- (4) `Update` updates the global parameters based on |
| 26 | +a user-defined formula. For example, it can update |
| 27 | +the new centroids based on the output computed by |
| 28 | +the Compute operator. |
| 29 | + |
| 30 | +- (5) `Sample` takes as input the size of the desired |
| 31 | +sample and the data points to sample from and re- |
| 32 | +turns a reduced set of sampled data points. |
| 33 | + |
| 34 | +- (6) `Converge` specifies a function that outputs |
| 35 | +a convergence dataset required for determining |
| 36 | +whether the iterations should continue or stop. |
| 37 | + |
| 38 | +- (7) `Loop` specifies the stopping condition on the |
| 39 | +convergence dataset. |
| 40 | + |
| 41 | +Similar to MapReduce, where users need to implement a map and reduce function, users of ML4all wishing to develop their own algorithm should implement the above interfaces. |
| 42 | +The interfaces can be found in `org.apache.wayang.ml4all.abstraction.api`. |
| 43 | + |
| 44 | +Examples for KMeans clustering and stochastic gradient descent can be found in `org.apache.wayang.ml4all.algorithms`. |
| 45 | + |
| 46 | +## Example runs |
| 47 | +- Kmeans: |
| 48 | + |
| 49 | +```shell |
| 50 | +./bin/wayang-submit org.apache.wayang.ml4all.examples.RunKMeans java,spark file:///Users/zoi/Work/WAYANG/forked/incubator-wayang/wayang-ml4all/src/main/resources/input/USCensus1990-sample.input 3 68 0 1 |
| 51 | +``` |
| 52 | + |
| 53 | +- SGD: |
| 54 | +```shell |
| 55 | +./bin/wayang-submit org.apache.wayang.ml4all.examples.RunSGD spark file:///Users/zoi/Work/WAYANG/forked/incubator-wayang/wayang-ml4all/src/main/resources/input/adult.zeros.input 100827 123 10 0.001 |
| 56 | +``` |
0 commit comments