Skip to content

Commit d5c276a

Browse files
committed
[Docs] Update ReadmMe
Signed-off-by: Austin Liu <[email protected]>
1 parent 78defa4 commit d5c276a

File tree

1 file changed

+15
-11
lines changed

1 file changed

+15
-11
lines changed

README.md

+15-11
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,13 @@
1717
under the License.
1818
-->
1919

20-
# datafusion-ray: DataFusion on Ray
20+
# DataFusion on Ray
2121

22-
This is a research project to evaluate performing distributed SQL queries from Python, using
22+
> This is a originally a research project donated from [ray-sql](https://github.com/datafusion-contrib/ray-sql) to evaluate performing distributed SQL queries from Python, using
2323
[Ray](https://www.ray.io/) and [DataFusion](https://github.com/apache/arrow-datafusion).
2424

25+
DataFusion Ray is a distributed SQL query engine powered by the Rust implementation of [Apache Arrow](https://arrow.apache.org/), [Apache DataFusion](https://datafusion.apache.org/) and [Ray](https://www.ray.io/).
26+
2527
## Goals
2628

2729
- Demonstrate how easily new systems can be built on top of DataFusion. See the [design documentation](./docs/README.md)
@@ -31,7 +33,9 @@ This is a research project to evaluate performing distributed SQL queries from P
3133

3234
## Non Goals
3335

34-
- Build and support a production system.
36+
- Re-build the cluster scheduling systems like what [Ballista](https://datafusion.apache.org/ballista/) did.
37+
- Ballista is extremely complex and utilizing Ray feels like it abstracts some of that complexity away.
38+
- Datafusion Ray is delegating cluster management to Ray.
3539

3640
## Example
3741

@@ -42,7 +46,7 @@ import os
4246
import pandas as pd
4347
import ray
4448

45-
from raysql import RaySqlContext
49+
from datafusion_ray import RaySqlContext
4650

4751
SCRIPT_DIR = os.path.dirname(os.path.realpath(__file__))
4852

@@ -64,7 +68,7 @@ for record_batch in result_set:
6468

6569
## Status
6670

67-
- RaySQL can run all queries in the TPC-H benchmark
71+
- DataFusion Ray can run all queries in the TPC-H benchmark
6872

6973
## Features
7074

@@ -73,29 +77,29 @@ for record_batch in result_set:
7377

7478
## Limitations
7579

76-
- Requires a shared file system currently
80+
- Requires a shared file system currently. Check details [here](./docs/README.md#distributed-shuffle).
7781

7882
## Performance
7983

80-
This chart shows the performance of RaySQL compared to Apache Spark for
84+
This chart shows the performance of DataFusion Ray compared to Apache Spark for
8185
[SQLBench-H](https://sqlbenchmarks.io/sqlbench-h/) at a very small data set (10GB), running on a desktop (Threadripper
82-
with 24 physical cores). Both RaySQL and Spark are configured with 24 executors.
86+
with 24 physical cores). Both DataFusion Ray and Spark are configured with 24 executors.
8387

8488
### Overall Time
8589

86-
RaySQL is ~1.9x faster overall for this scale factor and environment with disk-based shuffle.
90+
DataFusion Ray is ~1.9x faster overall for this scale factor and environment with disk-based shuffle.
8791

8892
![SQLBench-H Total](./docs/sqlbench-h-total.png)
8993

9094
### Per Query Time
9195

92-
Spark is much faster on some queries, likely due to broadcast exchanges, which RaySQL hasn't implemented yet.
96+
Spark is much faster on some queries, likely due to broadcast exchanges, which DataFusion Ray hasn't implemented yet.
9397

9498
![SQLBench-H Per Query](./docs/sqlbench-h-per-query.png)
9599

96100
### Performance Plan
97101

98-
I'm planning on experimenting with the following changes to improve performance:
102+
Plans on experimenting with the following changes to improve performance:
99103

100104
- Make better use of Ray futures to run more tasks in parallel
101105
- Use Ray object store for shuffle data transfer to reduce disk I/O cost

0 commit comments

Comments
 (0)