Skip to content

[C++] Substrait serialised expression size increases exponentially due to extension URIs being repeated #48761

@ahsanabbas123

Description

@ahsanabbas123

Describe the bug, including details regarding any error messages, version, and platform.

When serialising pyarrow expressions using substrait, the size of the serialised buffer increases exponentially with the number of binary operation clauses in the expression due to extension URIs being repeated again for each nested expression.

Testing with pyarrow 21.0.0:

================================================================================
Testing with 1 OR condition(s)
================================================================================
Expression:  (int_col == 1)
Substrait size: 201 bytes
Total extension URIs: 1

================================================================================
Testing with 2 OR condition(s)
================================================================================
Expression:  ((int_col == 1) or (int_col == 2))
Substrait size: 634 bytes
Total extension URIs: 5

================================================================================
Testing with 4 OR condition(s)
================================================================================
Expression:  ((((int_col == 1) or (int_col == 2)) or (int_col == 3)) or (int_col == 4))
Substrait size: 2967 bytes
Total extension URIs: 29

================================================================================
Testing with 6 OR condition(s)
================================================================================
Expression:  ((((((int_col == 1) or (int_col == 2)) or (int_col == 3)) or (int_col == 4)) or (int_col == 5)) or (int_col == 6))
Substrait size: 12017 bytes
Total extension URIs: 125

================================================================================
Testing with 8 OR condition(s)
================================================================================
Expression:  ((((((((int_col == 1) or (int_col == 2)) or (int_col == 3)) or (int_col == 4)) or (int_col == 5)) or (int_col == 6)) or (int_col == 7)) or (int_col == 8))
Substrait size: 48306 bytes
Total extension URIs: 509

================================================================================
Testing with 10 OR condition(s)
================================================================================
Expression:  ((((((((((int_col == 1) or (int_col == 2)) or (int_col == 3)) or (int_col == 4)) or (int_col == 5)) or (int_col == 6)) or (int_col == 7)) or (int_col == 8)) or (int_col == 9)) or (int_col == 10))
Substrait size: 193172 bytes
Total extension URIs: 2045

================================================================================
Testing with 12 OR condition(s)
================================================================================
Expression:  ((((((((((((int_col == 1) or (int_col == 2)) or (int_col == 3)) or (int_col == 4)) or (int_col == 5)) or (int_col == 6)) or (int_col == 7)) or (int_col == 8)) or (int_col == 9)) or (int_col == 10)) or (int_col == 11)) or (int_col == 12))
Substrait size: 772342 bytes
Total extension URIs: 8189

================================================================================
Testing with 14 OR condition(s)
================================================================================
Expression:  ((((((((((((((int_col == 1) or (int_col == 2)) or (int_col == 3)) or (int_col == 4)) or (int_col == 5)) or (int_col == 6)) or (int_col == 7)) or (int_col == 8)) or (int_col == 9)) or (int_col == 10)) or (int_col == 11)) or (int_col == 12)) or (int_col == 13)) or (int_col == 14))
Substrait size: 3105111 bytes
Total extension URIs: 32765

================================================================================
Summary:
================================================================================
 OR Conditions  |  Size (bytes)   |    Size (KB)    | Extension URIs 
----------------------------------------------------------------------
       1        |       201       |      0.20       |        1       
       2        |       634       |      0.62       |        5       
       4        |      2,967      |      2.90       |       29       
       6        |     12,017      |      11.74      |       125      
       8        |     48,306      |      47.17      |       509      
      10        |     193,172     |     188.64      |      2045      
      12        |     772,342     |     754.24      |      8189      
      14        |    3,105,111    |     3032.33     |      32765     

Serialization was done by simply: buf = pyarrow.substrait.serialize_expressions([expr], ["result"], schema)

Unfortunately, it might not be possible to just use is_in in the expression for my use-case so that might not be a possible workaround.

Would be nice to know if I am somehow misusing the API which is causing this. Thanks!

cc: @westonpace

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions