Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In light of the complexities associated with maintaining and debugging Ballista, the community would like to propose exploring the adoption of Ray SQL within the DataFusion Python project.
Ray SQL, with only 1.7k lines of Rust, is significantly simpler compared to Ballista’s 27k lines. Despite its smaller codebase, Ray SQL has demonstrated the ability to run all TPC-H queries, showcasing its robustness and simplicity. This reduction in complexity could ease maintenance, lower the learning curve for contributors, and provide a functional distributed SQL solution that is much simpler than Ballista because Ray provides so much – there is no need for us to build scheduler and executor processes – we can simply execute Python tasks in the Ray cluster.
If there is enough interest and support, we could start working on bringing the Ray SQL code into the DataFusion Python project.
Describe the solution you'd like
Following options are suggested by @andygrove , all feedbacks are welcome!
Bringing the Ray SQL prototype into the DataFusion Python project
Describe alternatives you've considered
Building a new version inspired by the Ray SQL code
Additional context
We might need to go through the Apache IP clearance process for importing the external Ray SQL codebase in datafusion-contrib
which is not part of Apache.