Skip to content

Commit c5744a4

Browse files
authored
Remove MLDataset document in README
1 parent 2df6b77 commit c5744a4

File tree

1 file changed

+3
-23
lines changed

1 file changed

+3
-23
lines changed

README.md

+3-23
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# RayDP
22

3-
RayDP is a distributed data processing library that provides simple APIs for running Spark on [Ray](https://github.com/ray-project/ray) and integrating Spark with distributed deep learning and machine learning frameworks. RayDP makes it simple to build distributed end-to-end data analytics and AI pipeline. Instead of using lots of glue code or an orchestration framework to stitch multiple distributed programs, RayDP allows you to write Spark, PyTorch, Tensorflow, XGBoost code in a single python program with increased productivity and performance. You can build an end-to-end pipeline on a single Ray cluster by using Spark for data preprocessing, RaySGD or Horovod for distributed deep learning, RayTune for hyperparameter tuning and RayServe for model serving.
3+
RayDP is a distributed data processing library that provides simple APIs for running Spark on [Ray](https://github.com/ray-project/ray) and integrating Spark with distributed deep learning and machine learning frameworks. RayDP makes it simple to build distributed end-to-end data analytics and AI pipeline. Instead of using lots of glue code or an orchestration framework to stitch multiple distributed programs, RayDP allows you to write Spark, PyTorch, Tensorflow, XGBoost code in a single python program with increased productivity and performance. You can build an end-to-end pipeline on a single Ray cluster by using Spark for data preprocessing, Ray Train or Horovod for distributed deep learning, RayTune for hyperparameter tuning and RayServe for model serving.
44

55
## Installation
66

@@ -85,7 +85,7 @@ raydp.stop_spark()
8585

8686
## Machine Learning and Deep Learning With a Spark DataFrame
8787

88-
RayDP provides APIs for converting a Spark DataFrame to a Ray Dataset or Ray MLDataset which can be consumed by XGBoost, RaySGD or Horovod on Ray. RayDP also provides high level scikit-learn style Estimator APIs for distributed training with PyTorch or Tensorflow.
88+
RayDP provides APIs for converting a Spark DataFrame to a Ray Dataset which can be consumed by XGBoost, Ray Train or Horovod on Ray. RayDP also provides high level scikit-learn style Estimator APIs for distributed training with PyTorch or Tensorflow.
8989

9090

9191
***Spark DataFrame <=> Ray Dataset***
@@ -109,29 +109,9 @@ df2 = ds2.to_spark(spark)
109109
```
110110
Please refer to [Spark+XGBoost on Ray](./examples/xgboost_ray_nyctaxi.py) for a full example.
111111

112-
***Spark DataFrame => Ray MLDataset***
113-
114-
RayDP provides an API for creating a Ray MLDataset from a Spark dataframe. MLDataset can be converted to a PyTorch or Tensorflow dataset for distributed training with Horovod on Ray or RaySGD. MLDataset is also supported by XGBoost on Ray as a data source.
115-
116-
```python
117-
import ray
118-
import raydp
119-
from raydp.spark import RayMLDataset
120-
121-
ray.init()
122-
spark = raydp.init_spark(app_name="RayDP Example",
123-
num_executors=2,
124-
executor_cores=2,
125-
executor_memory="4GB")
126-
127-
df = spark.range(0, 1000)
128-
ds = RayMLDataset.from_spark(df, num_shards=10)
129-
```
130-
Please refer to [Spark+Horovod on Ray](./examples/horovod_nyctaxi.py) for a full example.
131-
132112
***Estimator API***
133113

134-
The Estimator APIs allow you to train a deep neural network directly on a Spark DataFrame, leveraging Ray’s ability to scale out across the cluster. The Estimator APIs are wrappers of RaySGD and hide the complexity of converting a Spark DataFrame to a PyTorch/Tensorflow dataset and distributing the training. RayDP provides `raydp.torch.TorchEstimator` for PyTorch and `raydp.tf.TFEstimator` for Tensorflow. The following is an example of using TorchEstimator.
114+
The Estimator APIs allow you to train a deep neural network directly on a Spark DataFrame, leveraging Ray’s ability to scale out across the cluster. The Estimator APIs are wrappers of Ray Train and hide the complexity of converting a Spark DataFrame to a PyTorch/Tensorflow dataset and distributing the training. RayDP provides `raydp.torch.TorchEstimator` for PyTorch and `raydp.tf.TFEstimator` for Tensorflow. The following is an example of using TorchEstimator.
135115

136116
```python
137117
import ray

0 commit comments

Comments
 (0)