-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem? Please describe.
Analytics backends and data science tools increasingly demand high-performance, binary data transfer protocols. The current REST HTTP API, while flexible and widely compatible, introduces significant overhead for data-intensive workloads:
- JSON serialization/deserialization adds latency
- Text-based protocols are inefficient for large result sets
- No standard binary protocol means each client must implement custom optimizations
Modern analytics ecosystems (Python/pandas, R, Julia, Elixir/Livebook) are converging on Arrow as the standard in-memory columnar format, and ADBC (Arrow Database Connectivity) as the standard database access API. Users expect databases and semantic layers to support these standards natively.
Describe the solution you'd like
Add an Arrow Native server to CubeSQL that:
- Speaks Arrow IPC protocol on a dedicated port (default: 8120)
- Returns Arrow RecordBatches directly - no JSON serialization overhead
- Works with this ADBC client, Python.
- Optional query result caching for repeated queries
This enables 8-15x faster data transfer compared to the REST API for typical analytics workloads.
Describe alternatives you've considered
- Arrow Flight SQL - More complex protocol, requires gRPC. ADBC is simpler and sufficient for CubeSQL's use case.
- Optimizing REST API - JSON will always have serialization overhead. Binary protocols are fundamentally faster for columnar data.
- Custom binary protocol - Would require custom clients. ADBC is an emerging standard with growing ecosystem support.
Additional context
The ADBC ecosystem is maturing rapidly:
- Elixir/Livebook: The https://github.com/livebook-dev/adbc library provides ADBC bindings for the Elixir ecosystem. A working CubeSQL client extension is available at feat: Add Cube ADBC Driver for CubeSQL borodark/adbc#2.
- Real-world usage: The DataFrame from ADBC Client of Cube borodark/power_of_three#5 library demonstrates ADBC integration with Cube, showing 8-15x performance improvements over REST in production-like scenarios.
- Python/pandas: ADBC is becoming the recommended way to fetch data into DataFrames, replacing older approaches.
Having options is good - especially when one option is significantly faster. Users connecting BI tools via PostgreSQL protocol still work. Users calling the REST API still work. But users who need maximum performance now have a path: ADBC on port 8120.
Performance comparison (cached, 20K rows):
- REST HTTP API: 2133ms
- Arrow Native: 8ms (266x faster)