|
| 1 | +Modin (Pandas on Ray) |
| 2 | +===================== |
| 3 | + |
| 4 | +Modin_, previously Pandas on Ray, is a dataframe manipulation library that |
| 5 | +allows users to speed up their pandas workloads by acting as a drop-in |
| 6 | +replacement. Modin also provides support for other APIs (e.g. spreadsheet) |
| 7 | +and libraries, like xgboost. |
| 8 | + |
| 9 | +.. code-block:: python |
| 10 | +
|
| 11 | + import modin.pandas as pd |
| 12 | + import ray |
| 13 | +
|
| 14 | + ray.init() |
| 15 | + df = pd.read_parquet("s3://my-bucket/big.parquet") |
| 16 | +
|
| 17 | +You can use Modin on Ray with your laptop or cluster. In this document, |
| 18 | +we show instructions for how to set up a Modin compatible Ray cluster |
| 19 | +and connect Modin to Ray. |
| 20 | + |
| 21 | +.. note:: In previous versions of Modin, you had to initialize Ray before importing Modin. As of Modin 0.9.0, This is no longer the case. |
| 22 | + |
| 23 | +Using Modin with Ray's autoscaler |
| 24 | +--------------------------------- |
| 25 | + |
| 26 | +In order to use Modin with :ref:`Ray's autoscaler <cluster-index>`, you need to ensure that the |
| 27 | +correct dependencies are installed at startup. Modin's repository has an |
| 28 | +example `yaml file and set of tutorial notebooks`_ to ensure that the Ray |
| 29 | +cluster has the correct dependencies. Once the cluster is up, connect Modin |
| 30 | +by simply importing. |
| 31 | + |
| 32 | +.. code-block:: python |
| 33 | +
|
| 34 | + import modin.pandas as pd |
| 35 | + import ray |
| 36 | +
|
| 37 | + ray.init(address="auto") |
| 38 | + df = pd.read_parquet("s3://my-bucket/big.parquet") |
| 39 | +
|
| 40 | +As long as Ray is initialized before any dataframes are created, Modin |
| 41 | +will be able to connect to and use the Ray cluster. |
| 42 | + |
| 43 | +Modin with the Ray Client |
| 44 | +------------------------- |
| 45 | + |
| 46 | +When using Modin with the :ref:`Ray Client <ray-client>`, it is important to ensure that the |
| 47 | +cluster has all dependencies installed. |
| 48 | + |
| 49 | +.. code-block:: python |
| 50 | +
|
| 51 | + import modin.pandas as pd |
| 52 | + import ray |
| 53 | + import ray.util |
| 54 | +
|
| 55 | + ray.util.connect() |
| 56 | + df = pd.read_parquet("s3://my-bucket/big.parquet") |
| 57 | +
|
| 58 | +Modin will automatically use the Ray Client for computation when the file |
| 59 | +is read. |
| 60 | + |
| 61 | +How Modin uses Ray |
| 62 | +------------------ |
| 63 | + |
| 64 | +Modin has a layered architecture, and the core abstraction for data manipulation |
| 65 | +is the Modin Dataframe, which implements a novel algebra that enables Modin to |
| 66 | +handle all of pandas (see Modin's documentation_ for more on the architecture). |
| 67 | +Modin's internal dataframe object has a scheduling layer that is able to partition |
| 68 | +and operate on data with Ray. |
| 69 | + |
| 70 | +Dataframe operations |
| 71 | +'''''''''''''''''''' |
| 72 | + |
| 73 | +The Modin Dataframe uses Ray tasks to perform data manipulations. Ray Tasks have |
| 74 | +a number of benefits over the actor model for data manipulation: |
| 75 | + |
| 76 | +- Multiple tasks may be manipulating the same objects simultaneously |
| 77 | +- Objects in Ray's object store are immutable, making provenance and lineage easier |
| 78 | + to track |
| 79 | +- As new workers come online the shuffling of data will happen as tasks are |
| 80 | + scheduled on the new node |
| 81 | +- Identical partitions need not be replicated, especially beneficial for operations |
| 82 | + that selectively mutate the data (e.g. ``fillna``). |
| 83 | +- Finer grained parallelism with finer grained placement control |
| 84 | + |
| 85 | +Machine Learning |
| 86 | +'''''''''''''''' |
| 87 | + |
| 88 | +Modin uses Ray Actors for the machine learning support it currently provides. |
| 89 | +Modin's implementation of XGBoost is able to spin up one actor for each node |
| 90 | +and aggregate all of the partitions on that node to the XGBoost Actor. Modin |
| 91 | +is able to specify precisely the node IP for each actor on creation, giving |
| 92 | +fine-grained control over placement - a must for distributed training |
| 93 | +performance. |
| 94 | + |
| 95 | +.. _Modin: https://github.com/modin-project/modin |
| 96 | +.. _documentation: https://modin.readthedocs.io/en/latest/developer/architecture.html |
| 97 | +.. _yaml file and set of tutorial notebooks: https://github.com/modin-project/modin/tree/master/examples/tutorial/tutorial_notebooks/cluster |
0 commit comments