When compared to the row-oriented implementation, Arrow's performance fell short of expectations #14932

jayhan94 · 2025-02-28T10:49:34Z

jayhan94
Feb 28, 2025

When implementing a simple filtering and summation query using Arrow, I observed that the performance fell short of expectations. Compared to the row-oriented implementation, the performance degradation appears to be attributed to additional memory allocations. In contrast, the row-oriented engine demonstrates superior performance as it can avoid deep copying when transferring data between operators.

The experimental codebase is at here.

xudong963 · 2025-02-28T10:59:21Z

xudong963
Feb 28, 2025
Collaborator

The experimental codebase is at here.

The link is 404

1 reply

jayhan94 Feb 28, 2025
Author

I forget to make it public. Now it's ok.

alamb · 2025-03-01T12:12:16Z

alamb
Mar 1, 2025
Collaborator

Here is a good paper on the high level differences (in the background section):
Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask

0 replies

alamb · 2025-03-01T12:17:45Z

alamb
Mar 1, 2025
Collaborator

Compared to the row-oriented implementation, the performance degradation appears to be attributed to additional memory allocations.

I didn't look at your code too closely, but the actual datasource itself also seems to make many allocations
https://github.com/jayhan94/arrow-playground/blob/70dc0ee80f0507e9c4e3041addc0f36152cc4700/src/columnar.rs#L27-L50

As @tustvold said in Discord

I'd also recommend using a CPU profiler, e.g. hotspot, to analyse where your application is spending time. From a quick glance the way you are constructing a StringArray will perform a lot of unnecessary allocations

Here is some documentation on how to do it: https://datafusion.apache.org/library-user-guide/profiling.html

Note that it is possible to reuse the allocations in DataFusion's functions, though most of the built in ones don't do it as we don't normally see allocations as the bottleneck in filter evaluations

See the example here:

datafusion/datafusion-examples/examples/advanced_udf.rs

Lines 203 to 246 in 4d2e06f

    
           fn maybe_pow_in_place(base: f64, exp_array: ArrayRef) -> Result<ArrayRef> { 
        
               // Calling `unary` creates a new array for the results. Avoiding 
        
               // allocations is a common optimization in performance critical code. 
        
               // arrow-rs allows this optimization via the `unary_mut` 
        
               // and `binary_mut` kernels in certain cases 
        
               // 
        
               // These kernels can only be used if there are no other references to 
        
               // the arrays (exp_array has to be the last remaining reference). 
        
               let owned_array = exp_array 
        
                   // as in the previous example, we first downcast to &Float64Array 
        
                   .as_primitive::<Float64Type>() 
        
                   // non-obviously, we call clone here to get an owned `Float64Array`. 
        
                   // Calling clone() is relatively inexpensive as it increments 
        
                   // some ref counts but doesn't clone the data) 
        
                   // 
        
                   // Once we have the owned Float64Array we can drop the original 
        
                   // exp_array (untyped) reference 
        
                   .clone(); 
        
               // We *MUST* drop the reference to `exp_array` explicitly so that 
        
               // owned_array is the only reference remaining in this function. 
        
               // 
        
               // Note that depending on the query there may still be other references 
        
               // to the underlying buffers, which would prevent reuse. The only way to 
        
               // know for sure is the result of `compute::unary_mut` 
        
               drop(exp_array); 
        
               // If we have the only reference, compute the result directly into the same 
        
               // allocation as was used for the input array 
        
               match compute::unary_mut(owned_array, |exp| base.powf(exp)) { 
        
                   Err(_orig_array) => { 
        
                       // unary_mut will return the original array if there are other 
        
                       // references into the underling buffer (and thus reuse is 
        
                       // impossible) 
        
                       // 
        
                       // In a real implementation, this case should fall back to 
        
                       // calling `unary` and allocate a new array; In this example 
        
                       // we will return an error for demonstration purposes 
        
                       exec_err!("Could not reuse array for maybe_pow_in_place") 
        
                   } 
        
                   // a result of OK means the operation was run successfully 
        
                   Ok(res) => Ok(Arc::new(res)), 
        
               } 
        
           }

Most

1 reply

jayhan94 Mar 1, 2025
Author

Yes, I've optimized the code, and the arrow implementation now runs 5 times faster. I believe its performance has reached the expected level.

2010YOUY01 · 2025-03-03T07:43:30Z

2010YOUY01
Mar 3, 2025
Collaborator

One additional point is that real row-based system code can't be as optimized as your demo implementation.
For example, if there is an operation to extract the j-th element of the i-th row, the code will probably look like:

let row = input.get_row(i);
let elem = row.get_col(row.get_schema().get_column_datatype(j), j);

The key issue is that to extract a single element, the system has to pay the overhead of multiple function calls every row. If you benchmark an analytical query in a row-based system like PostgreSQL, most of the execution time will be spent on these function calls interpreting each row, rather than on the actual intended computation.

In contrast, vectorized engines like DataFusion only incur this function-calling overhead once per vector, which can contain thousands of elements. This significantly improves efficiency by amortizing function call overhead across many data points.

To generate such an optimized row-based implementation, there is another technique called compiled execution, which translates SQL queries directly into low-level code for execution. Currently, I think this kind of dark magic is mostly found in academia.

I remember this idea is discussed in https://15721.courses.cs.cmu.edu/spring2023/papers/03-storage/p967-abadi.pdf, or possibly in another one from the reading list on the https://15721.courses.cs.cmu.edu/spring2023/schedule.html.

4 replies

jayhan94 Mar 3, 2025
Author

I believe the CompiledProjectStream closely resembles compiled execution. In this implementation, it's aware of each column's data type and eliminates virtual function calls.

jayhan94 Mar 3, 2025
Author

I have re-implemented the row-based system with a more realistic approach that better aligns with real-world implementations. This has resulted in significant performance degradation.
The performance comparison among row-based execution, compiled execution, and Arrow-based execution now stands at approximately 1:5:25. I think this ratio aligns closely with what we would expect in realistic, production-grade scenarios.

alamb Mar 3, 2025
Collaborator

Makes sense -- @jayhan94 do you have a write up (e.g. a blog post)? I think hearing / seeing another perspective on this subject would be broadly interesting to people. The academic papers have a good treatment, but the barrier to entry of finding and reading them is high

A more generally accessable blog post would be really nice

jayhan94 Mar 3, 2025
Author

I haven't tried yet, but I'm willing to give it a shot.

When compared to the row-oriented implementation, Arrow's performance fell short of expectations #14932

Uh oh!

Uh oh!

jayhan94 Feb 28, 2025

Replies: 4 comments · 6 replies

Uh oh!

xudong963 Feb 28, 2025 Collaborator

Uh oh!

jayhan94 Feb 28, 2025 Author

Uh oh!

alamb Mar 1, 2025 Collaborator

Uh oh!

alamb Mar 1, 2025 Collaborator

Uh oh!

jayhan94 Mar 1, 2025 Author

Uh oh!

2010YOUY01 Mar 3, 2025 Collaborator

Uh oh!

jayhan94 Mar 3, 2025 Author

Uh oh!

jayhan94 Mar 3, 2025 Author

Uh oh!

alamb Mar 3, 2025 Collaborator

Uh oh!

jayhan94 Mar 3, 2025 Author

jayhan94
Feb 28, 2025

Replies: 4 comments 6 replies

xudong963
Feb 28, 2025
Collaborator

jayhan94 Feb 28, 2025
Author

alamb
Mar 1, 2025
Collaborator

alamb
Mar 1, 2025
Collaborator

jayhan94 Mar 1, 2025
Author

2010YOUY01
Mar 3, 2025
Collaborator

jayhan94 Mar 3, 2025
Author

jayhan94 Mar 3, 2025
Author

alamb Mar 3, 2025
Collaborator

jayhan94 Mar 3, 2025
Author