DataFusion Ray is a distributed execution framework that enables DataFusion DataFrame and SQL queries to run on a Ray cluster. This integration allows users to leverage Ray's dynamic scheduling capabilities while executing queries in a distributed fashion.
DataFusion Ray supports two execution modes:
This mode mimics the default execution strategy of DataFusion. Each operator in the query plan starts executing as soon as its inputs are available, leading to a more pipelined execution model.
Note: Batch Execution is not implemented yet. Tracking issue: #69
In this mode, execution follows a staged model similar to Apache Spark. Each query stage runs to completion, producing intermediate shuffle files that are persisted and used as input for the next stage.
See the contributor guide for instructions on building DataFusion Ray.
Once installed, you can run queries using DataFusion's familiar API while leveraging the distributed execution capabilities of Ray.
# from example in ./examples/http_csv.py
import ray
from datafusion_ray import DFRayContext, df_ray_runtime_env
ray.init(runtime_env=df_ray_runtime_env)
ctx = DFRayContext()
ctx.register_csv(
"aggregate_test_100",
"https://github.com/apache/arrow-testing/raw/master/data/csv/aggregate_test_100.csv",
)
df = ctx.sql("SELECT c1,c2,c3 FROM aggregate_test_100 LIMIT 5")
df.show()
Contributions are welcome! Please open an issue or submit a pull request if you would like to contribute. See the contributor guide for more information.
DataFusion Ray is licensed under Apache 2.0.