mleap/README.Rmd at master · r-spark/mleap · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
title: "R interface for MLeap"
output:
  github_document:
    fig_width: 9
    fig_height: 5
---

```{r setup, include = FALSE}
library(mleap)
```

[![Travis build status](https://travis-ci.org/rstudio/mleap.svg?branch=master)](https://travis-ci.org/rstudio/mleap) [![Coverage status](https://codecov.io/gh/rstudio/mleap/branch/master/graph/badge.svg)](https://codecov.io/github/rstudio/mleap?branch=master) [![CRAN status](https://www.r-pkg.org/badges/version/mleap)](https://cran.r-project.org/package=mleap)

**mleap** is a [sparklyr](http://spark.rstudio.com/) extension that provides an interface to [MLeap](https://github.com/combust/mleap), which allows us to take Spark pipelines to production.

## Getting started

**mleap** can be installed from CRAN via

```{r eval = FALSE}
install.packages("mleap")
```

or, for the latest development version from GitHub, using

```{r eval = FALSE}
devtools::install_github("rstudio/mleap")
```

Once mleap has been installed, we can install the external dependencies using

```{r eval = FALSE}
library(mleap)
install_maven()
# Alternatively, if you already have Maven installed, you can
#  set options(maven.home = "path/to/maven")
install_mleap()
```

We can now export Spark ML pipelines from sparklyr.

```{r, message = FALSE}
library(sparklyr)
sc <- spark_connect(master = "local", version = "2.3.0")
mtcars_tbl <- sdf_copy_to(sc, mtcars, overwrite = TRUE)

# Create a pipeline and fit it
pipeline <- ml_pipeline(sc) %>%
  ft_binarizer("hp", "big_hp", threshold = 100) %>%
  ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>%
  ml_gbt_regressor(label_col = "mpg")
pipeline_model <- ml_fit(pipeline, mtcars_tbl)

# A transformed data frame with the appropriate schema is required
#   for exporting the pipeline model
transformed_tbl <- ml_transform(pipeline_model, mtcars_tbl)

# Export model
model_path <- file.path(tempdir(), "mtcars_model.zip")
ml_write_bundle(pipeline_model, transformed_tbl, model_path)

# Disconnect from Spark
spark_disconnect(sc)
```

At this point, we can share `mtcars_model.zip` with our deployment/implementation engineers, and they would be able to embed the model in another application. See the [MLeap docs](http://mleap-docs.combust.ml/) for details.

We also provide R functions for testing that the saved models behave as expected. Here we load the previously saved model:

```{r}
model <- mleap_load_bundle(model_path)
model
```

We can retrieve the schema associated with the model:

```{r}
mleap_model_schema(model)
```

Then, we create a new data frame to be scored, and make predictions using our model:

```{r}
newdata <- tibble::tribble(
  ~qsec, ~hp, ~wt,
  16.2,  101, 2.68,
  18.1,  99,  3.08
)

# Transform the data frame
transformed_df <- mleap_transform(model, newdata)
dplyr::glimpse(transformed_df)
```