Skip to content

apache/datafusion-ray

DataFusion Ray

Apache licensed Python Tests Discord chat

Overview

DataFusion Ray is a distributed execution framework that enables DataFusion DataFrame and SQL queries to run on a Ray cluster. This integration allows users to leverage Ray's dynamic scheduling capabilities while executing queries in a distributed fashion.

Execution Modes

DataFusion Ray supports two execution modes:

Streaming Execution

This mode mimics the default execution strategy of DataFusion. Each operator in the query plan starts executing as soon as its inputs are available, leading to a more pipelined execution model.

Batch Execution

Note: Batch Execution is not implemented yet. Tracking issue: #69

In this mode, execution follows a staged model similar to Apache Spark. Each query stage runs to completion, producing intermediate shuffle files that are persisted and used as input for the next stage.

Getting Started

See the contributor guide for instructions on building DataFusion Ray.

Once installed, you can run queries using DataFusion's familiar API while leveraging the distributed execution capabilities of Ray.

# from example in ./examples/http_csv.py
import ray
from datafusion_ray import DFRayContext, df_ray_runtime_env

ray.init(runtime_env=df_ray_runtime_env)

ctx = DFRayContext()
ctx.register_csv(
    "aggregate_test_100",
    "https://github.com/apache/arrow-testing/raw/master/data/csv/aggregate_test_100.csv",
)

df = ctx.sql("SELECT c1,c2,c3 FROM aggregate_test_100 LIMIT 5")

df.show()

Contributing

Contributions are welcome! Please open an issue or submit a pull request if you would like to contribute. See the contributor guide for more information.

License

DataFusion Ray is licensed under Apache 2.0.