Skip to content

Commit b87fc1b

Browse files
devin-petersohnAlex
authored and
Alex
committed
lint/docs
1 parent c465171 commit b87fc1b

File tree

3 files changed

+101
-2
lines changed

3 files changed

+101
-2
lines changed

doc/source/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -312,6 +312,7 @@ Papers
312312
joblib.rst
313313
iter.rst
314314
xgboost-ray.rst
315+
modin/index.rst
315316
dask-on-ray.rst
316317
mars-on-ray.rst
317318
raydp.rst

doc/source/modin/index.rst

+97
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
Modin (Pandas on Ray)
2+
=====================
3+
4+
Modin_, previously Pandas on Ray, is a dataframe manipulation library that
5+
allows users to speed up their pandas workloads by acting as a drop-in
6+
replacement. Modin also provides support for other APIs (e.g. spreadsheet)
7+
and libraries, like xgboost.
8+
9+
.. code-block:: python
10+
11+
import modin.pandas as pd
12+
import ray
13+
14+
ray.init()
15+
df = pd.read_parquet("s3://my-bucket/big.parquet")
16+
17+
You can use Modin on Ray with your laptop or cluster. In this document,
18+
we show instructions for how to set up a Modin compatible Ray cluster
19+
and connect Modin to Ray.
20+
21+
.. note:: In previous versions of Modin, you had to initialize Ray before importing Modin. As of Modin 0.9.0, This is no longer the case.
22+
23+
Using Modin with Ray's autoscaler
24+
---------------------------------
25+
26+
In order to use Modin with :ref:`Ray's autoscaler <cluster-index>`, you need to ensure that the
27+
correct dependencies are installed at startup. Modin's repository has an
28+
example `yaml file and set of tutorial notebooks`_ to ensure that the Ray
29+
cluster has the correct dependencies. Once the cluster is up, connect Modin
30+
by simply importing.
31+
32+
.. code-block:: python
33+
34+
import modin.pandas as pd
35+
import ray
36+
37+
ray.init(address="auto")
38+
df = pd.read_parquet("s3://my-bucket/big.parquet")
39+
40+
As long as Ray is initialized before any dataframes are created, Modin
41+
will be able to connect to and use the Ray cluster.
42+
43+
Modin with the Ray Client
44+
-------------------------
45+
46+
When using Modin with the :ref:`Ray Client <ray-client>`, it is important to ensure that the
47+
cluster has all dependencies installed.
48+
49+
.. code-block:: python
50+
51+
import modin.pandas as pd
52+
import ray
53+
import ray.util
54+
55+
ray.util.connect()
56+
df = pd.read_parquet("s3://my-bucket/big.parquet")
57+
58+
Modin will automatically use the Ray Client for computation when the file
59+
is read.
60+
61+
How Modin uses Ray
62+
------------------
63+
64+
Modin has a layered architecture, and the core abstraction for data manipulation
65+
is the Modin Dataframe, which implements a novel algebra that enables Modin to
66+
handle all of pandas (see Modin's documentation_ for more on the architecture).
67+
Modin's internal dataframe object has a scheduling layer that is able to partition
68+
and operate on data with Ray.
69+
70+
Dataframe operations
71+
''''''''''''''''''''
72+
73+
The Modin Dataframe uses Ray tasks to perform data manipulations. Ray Tasks have
74+
a number of benefits over the actor model for data manipulation:
75+
76+
- Multiple tasks may be manipulating the same objects simultaneously
77+
- Objects in Ray's object store are immutable, making provenance and lineage easier
78+
to track
79+
- As new workers come online the shuffling of data will happen as tasks are
80+
scheduled on the new node
81+
- Identical partitions need not be replicated, especially beneficial for operations
82+
that selectively mutate the data (e.g. ``fillna``).
83+
- Finer grained parallelism with finer grained placement control
84+
85+
Machine Learning
86+
''''''''''''''''
87+
88+
Modin uses Ray Actors for the machine learning support it currently provides.
89+
Modin's implementation of XGBoost is able to spin up one actor for each node
90+
and aggregate all of the partitions on that node to the XGBoost Actor. Modin
91+
is able to specify precisely the node IP for each actor on creation, giving
92+
fine-grained control over placement - a must for distributed training
93+
performance.
94+
95+
.. _Modin: https://github.com/modin-project/modin
96+
.. _documentation: https://modin.readthedocs.io/en/latest/developer/architecture.html
97+
.. _yaml file and set of tutorial notebooks: https://github.com/modin-project/modin/tree/master/examples/tutorial/tutorial_notebooks/cluster

doc/source/ray-client.rst

+3-2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.. _ray-client:
2+
13
**********
24
Ray Client
35
**********
@@ -33,8 +35,7 @@ From here, another Ray script can access that server from a networked machine wi
3335
do_work.remote(2)
3436
#....
3537
36-
When the client disconnects, any object or actor references held by the server on behalf of the client are dropped, as if directly disconnecting from the cluster
37-
38+
When the client disconnects, any object or actor references held by the server on behalf of the client are dropped, as if directly disconnecting from the cluster.
3839

3940
===================
4041
``RAY_CLIENT_MODE``

0 commit comments

Comments
 (0)