Skip to content

Commit f263522

Browse files
authored
Update Datafusion Ray architecture docs (#27)
* Update Datafusion Ray architecture docs Signed-off-by: Austin Liu <[email protected]> * Focus on current architecture Signed-off-by: Austin Liu <[email protected]> --------- Signed-off-by: Austin Liu <[email protected]>
1 parent 9ed55ca commit f263522

File tree

1 file changed

+9
-13
lines changed

1 file changed

+9
-13
lines changed

docs/README.md

+9-13
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,12 @@
1717
under the License.
1818
-->
1919

20-
# RaySQL Design Documentation
20+
# DataFusion Ray Design Documentation
2121

22-
RaySQL is a distributed SQL query engine that is powered by DataFusion.
22+
DataFusion Ray is a distributed SQL query engine that is powered by DataFusion and Ray.
2323

2424
DataFusion provides a high-performance query engine that is already partition-aware, with partitions being executed
25-
in parallel in separate threads. RaySQL provides a distributed query planner that translates a DataFusion physical
25+
in parallel in separate threads. DataFusion Ray provides a distributed query planner that translates a DataFusion physical
2626
plan into a distributed plan.
2727

2828
Let's walk through an example to see how that works. We'll use [SQLBench-H](https://github.com/sql-benchmarks/sqlbench-h)
@@ -83,9 +83,6 @@ DataFusion's physical plan lists all the files to be queried, and they are organ
8383
parallel execution within a single process. In this example, the level of concurrency was configured to be four, so
8484
we see `partitions={4 groups: [[ ... ]]` in the leaf `ParquetExec` nodes, with the filenames listed in four groups.
8585

86-
_DataFusion will soon support parallel execution for single Parquet files but for now the parallelism is based on
87-
splitting the available files into separate groups, so RaySQL will not yet scale well for single-file inputs._
88-
8986
Here is the full physical plan for query 3.
9087

9188
```text
@@ -123,7 +120,7 @@ GlobalLimitExec: skip=0, fetch=10
123120
## Partitioning & Distribution
124121

125122
The partitioning scheme changes throughout the plan and this is the most important concept to
126-
understand in order to understand RaySQL's design. Changes in partitioning are implemented by the `RepartitionExec`
123+
understand in order to understand DataFusion Ray's design. Changes in partitioning are implemented by the `RepartitionExec`
127124
operator in DataFusion and are happen in the following scenarios.
128125

129126
### Joins
@@ -155,7 +152,7 @@ Sort also has multiple approaches.
155152
- The input partitions can be collapsed down to a single partition and then sorted
156153
- Partitions can be sorted in parallel and then merged using a sort-preserving merge
157154

158-
DataFusion and RaySQL currently the first approach, but there is a DataFusion PR open for implementing the second.
155+
DataFusion and DataFusion Ray currently choose the first approach, but there is a DataFusion PR open for implementing the second.
159156

160157
### Limit
161158

@@ -260,13 +257,12 @@ child plans, building up a DAG of futures.
260257

261258
## Distributed Shuffle
262259

263-
The output of each query stage needs to be persisted somewhere so that the next query stage can read it. Currently,
264-
RaySQL is just writing the output to disk in Arrow IPC format, and this means that RaySQL is not truly distributed
265-
yet because it requires a shared file system. It would be better to use the Ray object store instead, as
266-
proposed [here](https://github.com/datafusion-contrib/ray-sql/issues/22).
260+
The output of each query stage needs to be persisted somewhere so that the next query stage can read it.
261+
262+
DataFusion Ray uses the Ray object store as a shared file system, which was proposed [here](https://github.com/datafusion-contrib/ray-sql/issues/22) and implemented [here](https://github.com/datafusion-contrib/ray-sql/pull/33).
267263

268264
DataFusion's `RepartitionExec` uses threads and channels within a single process and is not suitable for a
269-
distributed query engine, so RaySQL rewrites the physical plan and replaces the `RepartionExec` with a pair of
265+
distributed query engine, so DataFusion Ray rewrites the physical plan and replaces the `RepartionExec` with a pair of
270266
operators to perform a "shuffle". These are the `ShuffleWriterExec` and `ShuffleReaderExec`.
271267

272268
### Shuffle Writes

0 commit comments

Comments
 (0)