Skip to content

Proposal to donate Ray SQL to the DataFusion Project (not into the Python subproject) #872

Closed
@austin362667

Description

@austin362667

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In light of the complexities associated with maintaining and debugging Ballista, the community would like to propose exploring the adoption of Ray SQL within the DataFusion Python project.

Ray SQL, with only 1.7k lines of Rust, is significantly simpler compared to Ballista’s 27k lines. Despite its smaller codebase, Ray SQL has demonstrated the ability to run all TPC-H queries, showcasing its robustness and simplicity. This reduction in complexity could ease maintenance, lower the learning curve for contributors, and provide a functional distributed SQL solution that is much simpler than Ballista because Ray provides so much – there is no need for us to build scheduler and executor processes – we can simply execute Python tasks in the Ray cluster.

If there is enough interest and support, we could start working on bringing the Ray SQL code into the DataFusion Python project.

Describe the solution you'd like

Following options are suggested by @andygrove , all feedbacks are welcome!

Bringing the Ray SQL prototype into the DataFusion Python project

Describe alternatives you've considered

Building a new version inspired by the Ray SQL code

Additional context

We might need to go through the Apache IP clearance process for importing the external Ray SQL codebase in datafusion-contrib which is not part of Apache.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions