You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
remove usage of ray.put(data, _owner) and private ray object ownership manipulation API (#454)
* remove usage of ray.put(data, owner)
* use mixin base class
* fix test
* eliminate usage of deserialize_and_register_object_ref
* distributed owner actors
* fix test
* reimplement recoverable conversion
* make data fetch task resource configurable
* only support ray 2.37.0 and beyond
* fix pyspark internal cache
* add test against ray 2.50.0
* remove usage of dashboard_grpc_port
* fix read_parquet on ray 2.5x, make tf test work on Apple silicon
* implement single owner
* rename RayDPBlockStoreActorRegistry to RayDPDataOwner
* add test to gate data locality
* align everything use from_spark_recoverable
Copy file name to clipboardExpand all lines: README.md
+8-11Lines changed: 8 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -153,28 +153,25 @@ Please refer to [NYC Taxi PyTorch Estimator](./examples/pytorch_nyctaxi.py) and
153
153
154
154
***Fault Tolerance***
155
155
156
-
The ray dataset converted from spark dataframe like above is not fault-tolerant. This is because we implement it using `Ray.put` combined with spark `mapPartitions`. Objects created by `Ray.put` is not recoverable in Ray.
156
+
RayDP now converts Spark DataFrames to Ray Datasets using a recoverable pipeline by default. This makes the resulting Ray Dataset resilient to Spark executor loss (the Arrow IPC bytes are cached in Spark and fetched via Ray tasks with lineage).
157
+
158
+
The recoverable conversion is also available directly via `raydp.spark.from_spark_recoverable`, and it persists (caches) the Spark DataFrame. You can provide the storage level through the `storage_level` keyword parameter.
157
159
158
-
RayDP now supports converting data in a way such that the resulting ray dataset is fault-tolerant. This feature is currently *experimental*. Here is how to use it:
159
160
```python
160
161
import ray
161
162
import raydp
162
163
163
164
ray.init(address="auto")
164
-
# set fault_tolerance_mode to True to enable the feature
Notice that `from_spark_recoverable` will persist the converted dataframe. You can provide the storage level through keyword parameter `storage_level`. In addition, this feature is not available in ray client mode. If you need to use ray client, please wrap your application in a ray actor, as described in the ray client chapter.
173
+
174
+
Note: recoverable conversion is not available in Ray client mode. If you need to use Ray client, wrap your application in a Ray actor as described in the Ray client docs.
0 commit comments