Rust bindings for libcudf, the GPU-accelerated DataFrame library from RAPIDS.
This project provides safe, idiomatic Rust bindings to cuDF using the cxx library for seamless C++/Rust interoperability. cuDF enables GPU-accelerated operations on DataFrames, offering significant performance improvements for data processing tasks.
For SQL execution, this project uses Apache DataFusion with a physical optimizer rule that replaces vanilla DataFusion nodes with GPU variants.
Taking the following query from the TPCH benchmark:
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date '1998-09-02'
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;DataFusion will produce the following executable plan:
SortPreservingMergeExec: [...]
SortExec: expr=[...], preserve_partitioning=[...]
ProjectionExec: expr=[...]
AggregateExec: mode=FinalPartitioned, gby=[...], aggr=[...]
RepartitionExec: partitioning=Hash([...], 4), input_partitions=4
AggregateExec: mode=Partial, gby=[...], aggr=[...]
ProjectionExec: expr=[...]
FilterExec: <expr>, projection=[...]
DataSourceExec: file_groups={4 groups: [...]}, projection=[...]This project inspects the plan and replaces nodes with their cuDF (GPU)-based variants, producing a different executable plan that looks like this:
CuDFUnloadExec, metrics=[...]
CuDFSortExec: expr=[...], preserve_partitioning=[...]
CuDFProjectionExec: expr=[...]
CuDFAggregateExec: mode=Single, group_by=[...], aggr_expr=[...]
CuDFProjectionExec: expr=[...]
CuDFFilterExec: l_shipdate@6 <= 1998-09-02, projection=[...]
CuDFLoadExec, metrics=[...]
DataSourceExec: file_groups={4 groups: [...]}, projection=[...]The cuDF-based plan is indeed cheaper and faster to execute than the pure CPU one. This was measured by comparing the execution latency in two different machines:
- m5.4xlarge | 16vCPU 64Gb RAM | ~$625 monthly | 906 ms TPCH Q1
- g4dn.xlarge | 4vCPU 16Gb NVIDIA T4 | ~$423 monthly | 813 ms TPCH Q1
Even if the GPU-based machine is cheaper because of having fewer vCPUs and less RAM, it's still capable of executing TPCH Q1, so doing some basic math, the conclusion is that, for the same latency, executing on GPU is 1.65x cheaper with the current state of this project.
This project is the result of a couple of weeks' hackathon, and there are several low-hanging fruit to be addressed that could make GPU execution significantly more performant.
Even though the focus of this project is to get TPCH Q1 working faster and cheaper in GPU vs CPU, it's capable of running the full TPCH suite on GPU. Rather than implementing a wide breadth of features, it focuses on laying the foundations for executing relational algebra on GPUs for a wide variety of use cases.
Follow-up work will bring further performance improvements and support for new relational algebra operations.
The project is organized as a Rust workspace with the following crates:
- libcudf-sys: Low-level FFI bindings to libcudf using cxx
- libcudf-rs: Safe, high-level Rust API wrapping the FFI bindings
- libcudf-datafusion: Integration with Apache DataFusion
Before building this project, you need:
-
CUDA Toolkit: Required for GPU operations
- Install from NVIDIA CUDA Downloads
-
libcudf: The cuDF C++ library
- Build from source: cuDF build instructions
- Or install via conda:
conda install -c rapidsai -c conda-forge cudf
-
Rust toolchain:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
-
C++ compiler: GCC 9+ or Clang that supports C++17
Once dependencies are installed:
# Build the project
cargo build
# Run tests (requires CUDA-capable GPU)
cargo test
# Build with release optimizations
cargo build --releaseAdd this to your Cargo.toml:
[dependencies]
libcudf-rs = { path = "path/to/libcudf-rs" }