Skip to content

Commit 2c8b8b8

Browse files
authored
Using tpch script from datafusion-benchmarks (#12)
* Using tpch script from datafusion-benchmarks * Using tpch script from datafusion-benchmarks * Reverting to single partition * Removing plans, reverting to single partition * Trying one partition only * Fixing tests * One partition only * Using TPCH Dbgen from Databricks * Restored partiition count * Will tests eventually pass? * Introducing regexp for determinism * Ignored additional tests * Ignored additional tests * Update README.md
1 parent 2523e9f commit 2c8b8b8

31 files changed

+1485
-2311
lines changed

.github/workflows/rust.yml

+23-11
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,36 @@ name: Rust
22

33
on:
44
push:
5-
branches: [ "main" ]
65
pull_request:
7-
branches: [ "main" ]
86

97
env:
108
CARGO_TERM_COLOR: always
9+
PYTHON_VERSION: 3.9
10+
TPCH_SCALING_FACTOR: "1"
11+
TPCH_TEST_PARTITIONS: "1"
12+
TPCH_DATA_PATH: "data"
1113

1214
jobs:
1315
build:
14-
1516
runs-on: ubuntu-latest
1617

1718
steps:
18-
- uses: actions/checkout@v3
19-
- name: Install protobuf compiler
20-
shell: bash
21-
run: sudo apt-get install protobuf-compiler
22-
- name: Build Rust code
23-
run: cargo build --verbose
24-
- name: Run tests
25-
run: cargo test --verbose
19+
- uses: actions/checkout@v3
20+
- name: Install protobuf compiler
21+
shell: bash
22+
run: sudo apt-get install protobuf-compiler
23+
- name: Build Rust code
24+
run: cargo build --verbose
25+
- name: Set up Python
26+
uses: actions/setup-python@v2
27+
with:
28+
python-version: ${{ env.PYTHON_VERSION }}
29+
- name: Install test dependencies
30+
run: |
31+
python -m pip install --upgrade pip
32+
pip install -r tpch/requirements.txt
33+
- name: Generate test data
34+
run: |
35+
./scripts/gen-test-data.sh
36+
- name: Run tests
37+
run: cargo test --verbose

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,5 @@ venv
55
*.so
66
*.log
77
results-sf*
8+
data
9+
tpch/tpch-dbgen

Cargo.lock

+31-6
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

+6-1
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,11 @@ uuid = "1.2"
4545
rustc_version = "0.4.0"
4646
tonic-build = { version = "0.8", default-features = false, features = ["transport", "prost"] }
4747

48+
[dev-dependencies]
49+
anyhow = "1.0.89"
50+
pretty_assertions = "1.4.0"
51+
regex = "1.11.0"
52+
4853
[lib]
4954
name = "datafusion_ray"
5055
crate-type = ["cdylib", "rlib"]
@@ -54,4 +59,4 @@ name = "datafusion_ray._datafusion_ray_internal"
5459

5560
[profile.release]
5661
codegen-units = 1
57-
lto = true
62+
lto = true

README.md

+34-6
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@
1919

2020
# DataFusion on Ray
2121

22-
> This was originally a research project donated from [ray-sql](https://github.com/datafusion-contrib/ray-sql) to evaluate performing distributed SQL queries from Python, using
23-
[Ray](https://www.ray.io/) and [DataFusion](https://github.com/apache/arrow-datafusion).
22+
> This was originally a research project donated from [ray-sql](https://github.com/datafusion-contrib/ray-sql) to evaluate performing distributed SQL queries from Python, using
23+
> [Ray](https://www.ray.io/) and [DataFusion](https://github.com/apache/arrow-datafusion).
2424
2525
DataFusion Ray is a distributed SQL query engine powered by the Rust implementation of [Apache Arrow](https://arrow.apache.org/), [Apache DataFusion](https://datafusion.apache.org/) and [Ray](https://www.ray.io/).
2626

@@ -33,7 +33,7 @@ DataFusion Ray is a distributed SQL query engine powered by the Rust implementat
3333

3434
## Non Goals
3535

36-
- Re-build the cluster scheduling systems like what [Ballista](https://datafusion.apache.org/ballista/) did.
36+
- Re-build the cluster scheduling systems like what [Ballista](https://datafusion.apache.org/ballista/) did.
3737
- Ballista is extremely complex and utilizing Ray feels like it abstracts some of that complexity away.
3838
- Datafusion Ray is delegating cluster management to Ray.
3939

@@ -120,10 +120,38 @@ python -m pip install -r requirements-in.txt
120120

121121
Whenever rust code changes (your changes or via `git pull`):
122122

123-
```bash
123+
````bash
124124
# make sure you activate the venv using "source venv/bin/activate" first
125-
maturin develop
126-
python -m pytest
125+
maturin develop python -m pytest ```
126+
127+
128+
## Testing
129+
130+
Running local Rust tests require generating the tpch-data. This can be done
131+
by running the following command:
132+
133+
```bash
134+
./scripts/generate_tpch_data.sh
135+
```
136+
137+
Tests compare plans with expected plans, which unfortunately contain the
138+
path to the parquet tables. The path committed under version control is
139+
the one of a Github Runner, and won't work locally. You can fix it by
140+
running the following command:
141+
142+
```bash
143+
./scripts/replace-expected-plan-paths.sh local-dev
144+
````
145+
146+
When instead you need to regenerate the plans, which you can do by
147+
re-running the planner tests removing all the content of
148+
`testdata/expected-plans`, they will now contain your local paths. You can
149+
fix it before committing the plans running
150+
151+
```bash
152+
153+
./scripts/replace-expected-plan-paths.sh pre-ci
154+
127155
```
128156
129157
## Benchmarking

scripts/gen-test-data.sh

+60
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
#!/bin/bash
2+
3+
set -e
4+
5+
create_directories() {
6+
mkdir -p data
7+
}
8+
9+
clone_and_build_tpch_dbgen() {
10+
if [ -z "$(ls -A tpch/tpch-dbgen)" ]; then
11+
echo "tpch/tpch-dbgen folder is empty. Cloning repository..."
12+
git clone https://github.com/databricks/tpch-dbgen.git tpch/tpch-dbgen
13+
cd tpch/tpch-dbgen
14+
make
15+
cd ../../
16+
else
17+
echo "tpch/tpch-dbgen folder is not empty. Skipping cloning of TPCH dbgen."
18+
fi
19+
}
20+
21+
generate_data() {
22+
cd tpch/tpch-dbgen
23+
if [ "$TPCH_TEST_PARTITIONS" -gt 1 ]; then
24+
for i in $(seq 1 "$TPCH_TEST_PARTITIONS"); do
25+
./dbgen -f -s "$TPCH_SCALING_FACTOR" -C "$TPCH_TEST_PARTITIONS" -S "$i"
26+
done
27+
else
28+
./dbgen -f -s "$TPCH_SCALING_FACTOR"
29+
fi
30+
mv ./*.tbl* ../../data
31+
}
32+
33+
convert_data() {
34+
cd ../../
35+
python -m tpch.tpchgen convert --partitions "$TPCH_TEST_PARTITIONS"
36+
}
37+
38+
main() {
39+
if [ -z "$TPCH_TEST_PARTITIONS" ]; then
40+
echo "Error: TPCH_TEST_PARTITIONS is not set."
41+
exit 1
42+
fi
43+
44+
if [ -z "$TPCH_SCALING_FACTOR" ]; then
45+
echo "Error: TPCH_SCALING_FACTOR is not set."
46+
exit 1
47+
fi
48+
49+
create_directories
50+
51+
if [ -z "$(ls -A data)" ]; then
52+
clone_and_build_tpch_dbgen
53+
generate_data
54+
convert_data
55+
else
56+
echo "Data folder is not empty. Skipping cloning and data generation."
57+
fi
58+
}
59+
60+
main
+44
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
#!/bin/bash
2+
3+
# This script helps change the path to parquet files in expected plans for
4+
# local development and CI
5+
6+
set -e
7+
8+
if [ "$#" -ne 1 ]; then
9+
echo "Usage: $0 <mode>"
10+
echo "Modes: pre-ci, local-dev"
11+
exit 1
12+
fi
13+
14+
# Assign the parameter to the mode variable
15+
mode=$1
16+
17+
ci_dir="home/runner/work/datafusion-ray/datafusion-ray"
18+
current_dir=$(pwd)
19+
current_dir_no_leading_slash="${current_dir#/}"
20+
expected_plans_dir="./testdata/expected-plans"
21+
22+
# Function to replace paths in files
23+
replace_paths() {
24+
local search=$1
25+
local replace=$2
26+
find "$expected_plans_dir" -type f -exec sed -i "s|$search|$replace|g" {} +
27+
echo "Replaced all occurrences of '$search' with '$replace' in files within '$expected_plans_dir'."
28+
}
29+
30+
# Handle the modes
31+
case $mode in
32+
pre-ci)
33+
replace_paths "$current_dir_no_leading_slash" "$ci_dir"
34+
;;
35+
local-dev)
36+
replace_paths "$ci_dir" "$current_dir_no_leading_slash"
37+
;;
38+
*)
39+
echo "Invalid mode: $mode"
40+
echo "Usage: $0 <mode>"
41+
echo "Modes: pre-ci, local-dev"
42+
exit 1
43+
;;
44+
esac

0 commit comments

Comments
 (0)